Opened 20 years ago

Last modified 6 months ago

#46 assigned defect

zero byte cleanliness in C# and Java bindings

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Xapian-bindings Version: SVN trunk
Severity: minor Keywords:
Cc: Richard Boulton Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Current status:

Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as \0 in Java when returned; character >= U+10000 is misencoded as two surrogate pair codepoints encoded into UTF-8

Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when returned

C#: Truncates at \0 on input, which seems to be a SWIG or pinvoke limitation; character >= U+10000 seems to work (at least on Linux)

Original description:

Check for zero byte cleanness wherever strings are used. There are a number of c_str()s in the code, but I believe all in the core library are harmless at 2002-04-29. There may be other zero byte issues though. xapian-applications/dbtools also uses c_str() where it should probably use data() and length(). xapian-bindings hasn't been checked.

Attachments (1)

java-use-string-critical-twice-as-slow-though.patch (2.1 KB ) - added by Olly Betts 13 years ago.
Patch to do proper character set conversion for Java

Download all attachments as: .zip

Change History (25)

comment:1 by Olly Betts, 20 years ago

Severity: blockernormal
Status: newassigned

comment:2 by Olly Betts, 20 years ago

All c_str calls in xapian-core are ok at 2004-12-05

comment:3 by Olly Betts, 20 years ago

No calls to c_str in xapian-examples!

comment:4 by Olly Betts, 18 years ago

Component: otherOther
op_sys: otherAll
rep_platform: OtherAll
Version: otherSVN HEAD

Most of the bindings now have a simple zero-byte cleanliness test.

The notable exception is C# which doesn't seem to easily support zero byte clean string passing currently.

Tcl encodes zero bytes as the non-normalised utf-8 form "\xc0\x80" so it's not actually possible to pass a zero byte into Xapian from Tcl...

comment:5 by Olly Betts, 17 years ago

Component: OtherXapian-bindings

comment:6 by Olly Betts, 17 years ago

Priority: highnormal
Severity: normalminor
Summary: zero byte cleanlinesszero byte cleanliness in C# bindings

comment:7 by Richard Boulton, 17 years ago

Blocking: 118 added

This should at least be investigated for the 1.0 release, though maybe it's too hard to fix in all bindings right now.

comment:8 by Olly Betts, 17 years ago

Summary: zero byte cleanliness in C# bindingszero byte cleanliness in C# and Java bindings

It's already fixed for everything except C# (oh, hmm and Java too) - there are tests in smoketest.* for it.

Tcl is fixed by virtue of not being able to have literal zero bytes in strings as they're encoded as non-normalised utf-8. Strictly speaking that should be addressed by checking for and translating '\xc0\x80' <-> '\0' as it's not valid utf-8 under current rules.

I asked on the SWIG list about C# ages ago and William Fulton seemed to suggest it was hard to fix for C# - the pinvoke stuff is built around C strings it seems.

Java is probably best fixed by transitioning to SWIG rather than expending effort on doomed code, but I'd rather leave moving it to SWIG until after 1.0. I don't think it's a huge amount of work, but it is a fundamental change that's potentially disruptive.

comment:9 by Richard Boulton, 17 years ago

Cc: richard@… added

comment:10 by Richard Boulton, 17 years ago

Okay - since fixing this isn't going to break anything which already works, I think it's a prime candidate for postponing to after 1.0.

comment:11 by Olly Betts, 17 years ago

Blocking: 118 removed

For Tcl, it would change what went in the database if you sent a "tcl nul", but if we ignore "\xc0\x80" in what comes out, it'll still essentially work OK (and I don't expect a lot of people put zero bytes in Xapian databases from Tcl anyway).

So I agree (removed block).

comment:12 by Richard Boulton, 17 years ago

Blocking: 120 added
Operating System: All

comment:14 by Richard Boulton, 16 years ago

Description: modified (diff)
Milestone: 1.1

comment:15 by Richard Boulton, 16 years ago

Blocking: 120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

comment:16 by Olly Betts, 16 years ago

As of 1.0.7, passing strings from Java to C++ is now zero byte safe.

comment:17 by Olly Betts, 16 years ago


These changes aren't incompatible, so bumping to 1.1.1.

comment:18 by Olly Betts, 15 years ago


Triaging milestone:1.1.1 bugs.

comment:19 by Olly Betts, 15 years ago


This shouldn't require incompatible changes. While it would be better to be zero-byte clean, we've not had any actual user complaints or comments, so I'm going to bump this out of 1.1.x.

comment:20 by Olly Betts, 13 years ago

Description: modified (diff)

SWIG-based Java isn't zero byte clean, but looking at the JNI methods used, I don't think the hand-coded JNI actually was on input (despite my comment above about it being so in 1.0.7).

Updated description with current status.

comment:21 by Olly Betts, 13 years ago

Description: modified (diff)

I must have had an unclean tree or something. SWIG-based Java bindings are just like Tcl - from the Java side they appear zero-byte clean, but actually in C++ we see \xc0\x80 for Java \0.

I tried a quick patch to use GetStringCritical() and convert to UTF-8 ourselves using's Xapian's Unicode support, which would mean \0 in Java <-> \0 in C++, and also would convert surrogate pairs in Java's representation properly to/from UTF-8 in C++. However, timing some operations which do a lot of string passing this is twice as slow, so I'm parking it for now. I'll attach it here so it doesn't get lost.

Added a Java testcase in r15917, which just checks that the roundtripping works.

Last edited 8 years ago by Olly Betts (previous) (diff)

by Olly Betts, 13 years ago

Patch to do proper character set conversion for Java

comment:22 by Olly Betts, 12 years ago

Milestone: 1.2.x1.3.x

It would be good to look at these issues for 1.3.x, particularly seeing if the Java patch can be made to work faster.

comment:23 by Olly Betts, 9 years ago

Milestone: 1.3.x1.4.x

This isn't worth holding up 1.4.0 for.

comment:24 by Olly Betts, 15 months ago

Description: modified (diff)

Testing with current git master:

  • Java: Still passes UTF-8 containing \xc0\x80 and codepoints >= U+10000 are encoded as high and low surrogate pair codepoints in UTF-8, which is invalid
  • csharp: Still truncates at a zero byte; codepoints >= U+10000 seem to work (on linux at least)
  • Tcl: Still passes UTF-8 containing \xc0\x80

A note on testing this: Looking at doc.get_description() can reveal zero bytes being passed with an invalid encoding as we get Document(docid=0, data=a\xc0\x80b) for input a\0b.

The same trick didn't work for surrogate pairs because get_description() didn't escape them and they converted back to the same Java string we fed in. However this is really a bug - we should treat surrogate pair code units encoded as UTF-8 as invalid, like we already do for overlong sequences. [d390be27767642fae189b10d82e902d5cb2f30ed] implements this change, and this trick now works for testing the surrogate case too.

[b8509b41c920e36ecfdd8a1a0b95e73f5f738b66] adds testcases where we don't already have them (with crude XPASS if the cases we know to fail should change to actually pass).

comment:25 by Olly Betts, 6 months ago

Milestone: 1.4.x2.0.0

Postponing. Could likely be backported to a stable release once implemented.

For Java a possible approach is described here:

it just occurred to me that you could create a Java string in JNI by calling one of the following String constructors from JNI:

Note: See TracTickets for help on using tickets.