Context Navigation

← Previous Ticket
Next Ticket →

#46 assigned defect

zero byte cleanliness in C# and Java bindings

Reported by:	Olly Betts	Owned by:	Olly Betts
Priority:	normal	Milestone:	2.0.0
Component:	Xapian-bindings	Version:	SVN trunk
Severity:	minor	Keywords:
Cc:	Richard Boulton	Blocked By:
Blocking:		Operating System:	All

Description (last modified by Olly Betts)

Current status:

Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as \0 in Java when returned; character >= U+10000 is misencoded as two surrogate pair codepoints encoded into UTF-8

Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when returned

C#: Truncates at \0 on input, which seems to be a SWIG or pinvoke limitation; character >= U+10000 seems to work (at least on Linux)

Original description:

Check for zero byte cleanness wherever strings are used. There are a number of c_str()s in the code, but I believe all in the core library are harmless at 2002-04-29. There may be other zero byte issues though. xapian-applications/dbtools also uses c_str() where it should probably use data() and length(). xapian-bindings hasn't been checked.

Attachments (1)

java-use-string-critical-twice-as-slow-though.patch (2.1 KB ) - added by Olly Betts 14 years ago.: Patch to do proper character set conversion for Java

Download all attachments as: .zip

Change History (25)

comment:1 by Olly Betts, 21 years ago

Severity:	blocker → normal
Status:	new → assigned

comment:2 by Olly Betts, 21 years ago

All c_str calls in xapian-core are ok at 2004-12-05

comment:3 by Olly Betts, 21 years ago

No calls to c_str in xapian-examples!

comment:4 by Olly Betts, 19 years ago

Component:	other → Other
op_sys:	other → All
rep_platform:	Other → All
Version:	other → SVN HEAD

Most of the bindings now have a simple zero-byte cleanliness test.

The notable exception is C# which doesn't seem to easily support zero byte clean string passing currently.

Tcl encodes zero bytes as the non-normalised utf-8 form "\xc0\x80" so it's not actually possible to pass a zero byte into Xapian from Tcl...

comment:5 by Olly Betts, 18 years ago

Component:	Other → Xapian-bindings

comment:6 by Olly Betts, 18 years ago

Priority:	high → normal
Severity:	normal → minor
Summary:	zero byte cleanliness → zero byte cleanliness in C# bindings

comment:7 by Richard Boulton, 18 years ago

Blocking:	118 added

This should at least be investigated for the 1.0 release, though maybe it's too hard to fix in all bindings right now.

comment:8 by Olly Betts, 18 years ago

Summary:	zero byte cleanliness in C# bindings → zero byte cleanliness in C# and Java bindings

It's already fixed for everything except C# (oh, hmm and Java too) - there are tests in smoketest.* for it.

Tcl is fixed by virtue of not being able to have literal zero bytes in strings as they're encoded as non-normalised utf-8. Strictly speaking that should be addressed by checking for and translating '\xc0\x80' <-> '\0' as it's not valid utf-8 under current rules.

I asked on the SWIG list about C# ages ago and William Fulton seemed to suggest it was hard to fix for C# - the pinvoke stuff is built around C strings it seems.

Java is probably best fixed by transitioning to SWIG rather than expending effort on doomed code, but I'd rather leave moving it to SWIG until after 1.0. I don't think it's a huge amount of work, but it is a fundamental change that's potentially disruptive.

comment:9 by Richard Boulton, 18 years ago

Cc:	richard@… added

comment:10 by Richard Boulton, 18 years ago

Okay - since fixing this isn't going to break anything which already works, I think it's a prime candidate for postponing to after 1.0.

comment:11 by Olly Betts, 18 years ago

Blocking:	118 removed

For Tcl, it would change what went in the database if you sent a "tcl nul", but if we ignore "\xc0\x80" in what comes out, it'll still essentially work OK (and I don't expect a lot of people put zero bytes in Xapian databases from Tcl anyway).

So I agree (removed block).

comment:12 by Richard Boulton, 18 years ago

Blocking:	120 added
Operating System:	→ All

comment:14 by Richard Boulton, 17 years ago

Description:	modified (diff)
Milestone:	→ 1.1

comment:15 by Richard Boulton, 17 years ago

Blocking:	120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

comment:16 by Olly Betts, 17 years ago

As of 1.0.7, passing strings from Java to C++ is now zero byte safe.

comment:17 by Olly Betts, 17 years ago

Milestone:	1.1.0 → 1.1.1

These changes aren't incompatible, so bumping to 1.1.1.

comment:18 by Olly Betts, 16 years ago

Milestone:	1.1.1 → 1.1.7

Triaging milestone:1.1.1 bugs.

comment:19 by Olly Betts, 16 years ago

Milestone:	1.1.7 → 1.2.0

This shouldn't require incompatible changes. While it would be better to be zero-byte clean, we've not had any actual user complaints or comments, so I'm going to bump this out of 1.1.x.

comment:20 by Olly Betts, 14 years ago

Description:	modified (diff)

SWIG-based Java isn't zero byte clean, but looking at the JNI methods used, I don't think the hand-coded JNI actually was on input (despite my comment above about it being so in 1.0.7).

Updated description with current status.

comment:21 by Olly Betts, 14 years ago

Description:	modified (diff)

I must have had an unclean tree or something. SWIG-based Java bindings are just like Tcl - from the Java side they appear zero-byte clean, but actually in C++ we see \xc0\x80 for Java \0.

I tried a quick patch to use GetStringCritical() and convert to UTF-8 ourselves using's Xapian's Unicode support, which would mean \0 in Java <-> \0 in C++, and also would convert surrogate pairs in Java's representation properly to/from UTF-8 in C++. However, timing some operations which do a lot of string passing this is twice as slow, so I'm parking it for now. I'll attach it here so it doesn't get lost.

Added a Java testcase in r15917, which just checks that the roundtripping works.

Last edited 9 years ago by Olly Betts (previous) (diff)

by Olly Betts, 14 years ago

Attachment:	java-use-string-critical-twice-as-slow-though.patch added

Patch to do proper character set conversion for Java

comment:22 by Olly Betts, 14 years ago

Milestone:	1.2.x → 1.3.x

It would be good to look at these issues for 1.3.x, particularly seeing if the Java patch can be made to work faster.

comment:23 by Olly Betts, 10 years ago

Milestone:	1.3.x → 1.4.x

This isn't worth holding up 1.4.0 for.

comment:24 by Olly Betts, 2 years ago

Description:	modified (diff)

Testing with current git master:

Java: Still passes UTF-8 containing \xc0\x80 and codepoints >= U+10000 are encoded as high and low surrogate pair codepoints in UTF-8, which is invalid
csharp: Still truncates at a zero byte; codepoints >= U+10000 seem to work (on linux at least)
Tcl: Still passes UTF-8 containing \xc0\x80

A note on testing this: Looking at doc.get_description() can reveal zero bytes being passed with an invalid encoding as we get Document(docid=0, data=a\xc0\x80b) for input a\0b.

The same trick didn't work for surrogate pairs because get_description() didn't escape them and they converted back to the same Java string we fed in. However this is really a bug - we should treat surrogate pair code units encoded as UTF-8 as invalid, like we already do for overlong sequences. [d390be27767642fae189b10d82e902d5cb2f30ed] implements this change, and this trick now works for testing the surrogate case too.

[b8509b41c920e36ecfdd8a1a0b95e73f5f738b66] adds testcases where we don't already have them (with crude XPASS if the cases we know to fail should change to actually pass).

comment:25 by Olly Betts, 20 months ago

Milestone:	1.4.x → 2.0.0

Postponing. Could likely be backported to a stable release once implemented.

For Java a possible approach is described here: https://github.com/swig/swig/issues/2597#issuecomment-1728981776

it just occurred to me that you could create a Java string in JNI by calling one of the following String constructors from JNI:

Note: See TracTickets for help on using tickets.

Download in other formats: