Opened 22 years ago

Last modified 4 months ago

#46 assigned defect

zero byte cleanliness in C# and Java bindings

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 3.0.0
Component: Xapian-bindings Version: SVN trunk
Severity: minor Keywords:
Cc: Richard Boulton Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Current status:

Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as \0 in Java when returned; character >= U+10000 is misencoded as two surrogate pair codepoints encoded into UTF-8

Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when returned

C#: Truncates at \0 on input, which seems to be a SWIG or pinvoke limitation; character >= U+10000 seems to work (at least on Linux)


Original description:

Check for zero byte cleanness wherever strings are used. There are a number of c_str()s in the code, but I believe all in the core library are harmless at 2002-04-29. There may be other zero byte issues though. xapian-applications/dbtools also uses c_str() where it should probably use data() and length(). xapian-bindings hasn't been checked.

Attachments (1)

java-use-string-critical-twice-as-slow-though.patch (2.1 KB ) - added by Olly Betts 15 years ago.
Patch to do proper character set conversion for Java

Download all attachments as: .zip

Change History (26)

comment:1 by Olly Betts, 22 years ago

Severity: blockernormal
Status: newassigned

comment:2 by Olly Betts, 21 years ago

All c_str calls in xapian-core are ok at 2004-12-05

comment:3 by Olly Betts, 21 years ago

No calls to c_str in xapian-examples!

comment:4 by Olly Betts, 19 years ago

Component: otherOther
op_sys: otherAll
rep_platform: OtherAll
Version: otherSVN HEAD

Most of the bindings now have a simple zero-byte cleanliness test.

The notable exception is C# which doesn't seem to easily support zero byte clean string passing currently.

Tcl encodes zero bytes as the non-normalised utf-8 form "\xc0\x80" so it's not actually possible to pass a zero byte into Xapian from Tcl...

comment:5 by Olly Betts, 19 years ago

Component: OtherXapian-bindings

comment:6 by Olly Betts, 19 years ago

Priority: highnormal
Severity: normalminor
Summary: zero byte cleanlinesszero byte cleanliness in C# bindings

comment:7 by Richard Boulton, 19 years ago

Blocking: 118 added

This should at least be investigated for the 1.0 release, though maybe it's too hard to fix in all bindings right now.

comment:8 by Olly Betts, 19 years ago

Summary: zero byte cleanliness in C# bindingszero byte cleanliness in C# and Java bindings

It's already fixed for everything except C# (oh, hmm and Java too) - there are tests in smoketest.* for it.

Tcl is fixed by virtue of not being able to have literal zero bytes in strings as they're encoded as non-normalised utf-8. Strictly speaking that should be addressed by checking for and translating '\xc0\x80' <-> '\0' as it's not valid utf-8 under current rules.

I asked on the SWIG list about C# ages ago and William Fulton seemed to suggest it was hard to fix for C# - the pinvoke stuff is built around C strings it seems.

Java is probably best fixed by transitioning to SWIG rather than expending effort on doomed code, but I'd rather leave moving it to SWIG until after 1.0. I don't think it's a huge amount of work, but it is a fundamental change that's potentially disruptive.

comment:9 by Richard Boulton, 19 years ago

Cc: richard@… added

comment:10 by Richard Boulton, 19 years ago

Okay - since fixing this isn't going to break anything which already works, I think it's a prime candidate for postponing to after 1.0.

comment:11 by Olly Betts, 19 years ago

Blocking: 118 removed

For Tcl, it would change what went in the database if you sent a "tcl nul", but if we ignore "\xc0\x80" in what comes out, it'll still essentially work OK (and I don't expect a lot of people put zero bytes in Xapian databases from Tcl anyway).

So I agree (removed block).

comment:12 by Richard Boulton, 19 years ago

Blocking: 120 added
Operating System: All

comment:14 by Richard Boulton, 18 years ago

Description: modified (diff)
Milestone: 1.1

comment:15 by Richard Boulton, 18 years ago

Blocking: 120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

comment:16 by Olly Betts, 18 years ago

As of 1.0.7, passing strings from Java to C++ is now zero byte safe.

comment:17 by Olly Betts, 17 years ago

Milestone: 1.1.01.1.1

These changes aren't incompatible, so bumping to 1.1.1.

comment:18 by Olly Betts, 17 years ago

Milestone: 1.1.11.1.7

Triaging milestone:1.1.1 bugs.

comment:19 by Olly Betts, 17 years ago

Milestone: 1.1.71.2.0

This shouldn't require incompatible changes. While it would be better to be zero-byte clean, we've not had any actual user complaints or comments, so I'm going to bump this out of 1.1.x.

comment:20 by Olly Betts, 15 years ago

Description: modified (diff)

SWIG-based Java isn't zero byte clean, but looking at the JNI methods used, I don't think the hand-coded JNI actually was on input (despite my comment above about it being so in 1.0.7).

Updated description with current status.

comment:21 by Olly Betts, 15 years ago

Description: modified (diff)

I must have had an unclean tree or something. SWIG-based Java bindings are just like Tcl - from the Java side they appear zero-byte clean, but actually in C++ we see \xc0\x80 for Java \0.

I tried a quick patch to use GetStringCritical() and convert to UTF-8 ourselves using's Xapian's Unicode support, which would mean \0 in Java <-> \0 in C++, and also would convert surrogate pairs in Java's representation properly to/from UTF-8 in C++. However, timing some operations which do a lot of string passing this is twice as slow, so I'm parking it for now. I'll attach it here so it doesn't get lost.

Added a Java testcase in r15917, which just checks that the roundtripping works.

Last edited 10 years ago by Olly Betts (previous) (diff)

by Olly Betts, 15 years ago

Patch to do proper character set conversion for Java

comment:22 by Olly Betts, 14 years ago

Milestone: 1.2.x1.3.x

It would be good to look at these issues for 1.3.x, particularly seeing if the Java patch can be made to work faster.

comment:23 by Olly Betts, 11 years ago

Milestone: 1.3.x1.4.x

This isn't worth holding up 1.4.0 for.

comment:24 by Olly Betts, 3 years ago

Description: modified (diff)

Testing with current git master:

  • Java: Still passes UTF-8 containing \xc0\x80 and codepoints >= U+10000 are encoded as high and low surrogate pair codepoints in UTF-8, which is invalid
  • csharp: Still truncates at a zero byte; codepoints >= U+10000 seem to work (on linux at least)
  • Tcl: Still passes UTF-8 containing \xc0\x80

A note on testing this: Looking at doc.get_description() can reveal zero bytes being passed with an invalid encoding as we get Document(docid=0, data=a\xc0\x80b) for input a\0b.

The same trick didn't work for surrogate pairs because get_description() didn't escape them and they converted back to the same Java string we fed in. However this is really a bug - we should treat surrogate pair code units encoded as UTF-8 as invalid, like we already do for overlong sequences. [d390be27767642fae189b10d82e902d5cb2f30ed] implements this change, and this trick now works for testing the surrogate case too.

[b8509b41c920e36ecfdd8a1a0b95e73f5f738b66] adds testcases where we don't already have them (with crude XPASS if the cases we know to fail should change to actually pass).

comment:25 by Olly Betts, 2 years ago

Milestone: 1.4.x2.0.0

Postponing. Could likely be backported to a stable release once implemented.

For Java a possible approach is described here: https://github.com/swig/swig/issues/2597#issuecomment-1728981776

it just occurred to me that you could create a Java string in JNI by calling one of the following String constructors from JNI:

comment:26 by Olly Betts, 4 months ago

Milestone: 2.0.03.0.0

Milestone renamed

Note: See TracTickets for help on using tickets.