Opened 21 years ago
Last modified 2 years ago
#46 assigned defect
zero byte cleanliness in C# and Java bindings
| Reported by: | Olly Betts | Owned by: | Olly Betts |
|---|---|---|---|
| Priority: | normal | Milestone: | 2.0.0 |
| Component: | Xapian-bindings | Version: | SVN trunk |
| Severity: | minor | Keywords: | |
| Cc: | Richard Boulton | Blocked By: | |
| Blocking: | Operating System: | All |
Description (last modified by )
Current status:
Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as \0 in Java when returned; character >= U+10000 is misencoded as two surrogate pair codepoints encoded into UTF-8
Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when returned
C#: Truncates at \0 on input, which seems to be a SWIG or pinvoke limitation; character >= U+10000 seems to work (at least on Linux)
Original description:
Check for zero byte cleanness wherever strings are used. There are a number of c_str()s in the code, but I believe all in the core library are harmless at 2002-04-29. There may be other zero byte issues though. xapian-applications/dbtools also uses c_str() where it should probably use data() and length(). xapian-bindings hasn't been checked.
Attachments (1)
Change History (25)
comment:1 by , 21 years ago
| Severity: | blocker → normal |
|---|---|
| Status: | new → assigned |
comment:2 by , 21 years ago
comment:4 by , 19 years ago
| Component: | other → Other |
|---|---|
| op_sys: | other → All |
| rep_platform: | Other → All |
| Version: | other → SVN HEAD |
Most of the bindings now have a simple zero-byte cleanliness test.
The notable exception is C# which doesn't seem to easily support zero byte clean string passing currently.
Tcl encodes zero bytes as the non-normalised utf-8 form "\xc0\x80" so it's not actually possible to pass a zero byte into Xapian from Tcl...
comment:5 by , 19 years ago
| Component: | Other → Xapian-bindings |
|---|
comment:6 by , 19 years ago
| Priority: | high → normal |
|---|---|
| Severity: | normal → minor |
| Summary: | zero byte cleanliness → zero byte cleanliness in C# bindings |
comment:7 by , 19 years ago
| Blocking: | 118 added |
|---|
This should at least be investigated for the 1.0 release, though maybe it's too hard to fix in all bindings right now.
comment:8 by , 19 years ago
| Summary: | zero byte cleanliness in C# bindings → zero byte cleanliness in C# and Java bindings |
|---|
It's already fixed for everything except C# (oh, hmm and Java too) - there are tests in smoketest.* for it.
Tcl is fixed by virtue of not being able to have literal zero bytes in strings as they're encoded as non-normalised utf-8. Strictly speaking that should be addressed by checking for and translating '\xc0\x80' <-> '\0' as it's not valid utf-8 under current rules.
I asked on the SWIG list about C# ages ago and William Fulton seemed to suggest it was hard to fix for C# - the pinvoke stuff is built around C strings it seems.
Java is probably best fixed by transitioning to SWIG rather than expending effort on doomed code, but I'd rather leave moving it to SWIG until after 1.0. I don't think it's a huge amount of work, but it is a fundamental change that's potentially disruptive.
comment:9 by , 19 years ago
| Cc: | added |
|---|
comment:10 by , 19 years ago
Okay - since fixing this isn't going to break anything which already works, I think it's a prime candidate for postponing to after 1.0.
comment:11 by , 19 years ago
| Blocking: | 118 removed |
|---|
For Tcl, it would change what went in the database if you sent a "tcl nul", but if we ignore "\xc0\x80" in what comes out, it'll still essentially work OK (and I don't expect a lot of people put zero bytes in Xapian databases from Tcl anyway).
So I agree (removed block).
comment:12 by , 19 years ago
| Blocking: | 120 added |
|---|---|
| Operating System: | → All |
comment:14 by , 18 years ago
| Description: | modified (diff) |
|---|---|
| Milestone: | → 1.1 |
comment:15 by , 18 years ago
| Blocking: | 120 removed |
|---|
(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.
comment:17 by , 17 years ago
| Milestone: | 1.1.0 → 1.1.1 |
|---|
These changes aren't incompatible, so bumping to 1.1.1.
comment:19 by , 16 years ago
| Milestone: | 1.1.7 → 1.2.0 |
|---|
This shouldn't require incompatible changes. While it would be better to be zero-byte clean, we've not had any actual user complaints or comments, so I'm going to bump this out of 1.1.x.
comment:20 by , 14 years ago
| Description: | modified (diff) |
|---|
SWIG-based Java isn't zero byte clean, but looking at the JNI methods used, I don't think the hand-coded JNI actually was on input (despite my comment above about it being so in 1.0.7).
Updated description with current status.
comment:21 by , 14 years ago
| Description: | modified (diff) |
|---|
I must have had an unclean tree or something. SWIG-based Java bindings are just like Tcl - from the Java side they appear zero-byte clean, but actually in C++ we see \xc0\x80 for Java \0.
I tried a quick patch to use GetStringCritical() and convert to UTF-8 ourselves using's Xapian's Unicode support, which would mean \0 in Java <-> \0 in C++, and also would convert surrogate pairs in Java's representation properly to/from UTF-8 in C++. However, timing some operations which do a lot of string passing this is twice as slow, so I'm parking it for now. I'll attach it here so it doesn't get lost.
Added a Java testcase in r15917, which just checks that the roundtripping works.
by , 14 years ago
| Attachment: | java-use-string-critical-twice-as-slow-though.patch added |
|---|
Patch to do proper character set conversion for Java
comment:22 by , 14 years ago
| Milestone: | 1.2.x → 1.3.x |
|---|
It would be good to look at these issues for 1.3.x, particularly seeing if the Java patch can be made to work faster.
comment:24 by , 3 years ago
| Description: | modified (diff) |
|---|
Testing with current git master:
- Java: Still passes UTF-8 containing
\xc0\x80and codepoints >= U+10000 are encoded as high and low surrogate pair codepoints in UTF-8, which is invalid - csharp: Still truncates at a zero byte; codepoints >= U+10000 seem to work (on linux at least)
- Tcl: Still passes UTF-8 containing
\xc0\x80
A note on testing this: Looking at doc.get_description() can reveal zero bytes being passed with an invalid encoding as we get Document(docid=0, data=a\xc0\x80b) for input a\0b.
The same trick didn't work for surrogate pairs because get_description() didn't escape them and they converted back to the same Java string we fed in. However this is really a bug - we should treat surrogate pair code units encoded as UTF-8 as invalid, like we already do for overlong sequences. [d390be27767642fae189b10d82e902d5cb2f30ed] implements this change, and this trick now works for testing the surrogate case too.
[b8509b41c920e36ecfdd8a1a0b95e73f5f738b66] adds testcases where we don't already have them (with crude XPASS if the cases we know to fail should change to actually pass).
comment:25 by , 2 years ago
| Milestone: | 1.4.x → 2.0.0 |
|---|
Postponing. Could likely be backported to a stable release once implemented.
For Java a possible approach is described here: https://github.com/swig/swig/issues/2597#issuecomment-1728981776
it just occurred to me that you could create a Java string in JNI by calling one of the following String constructors from JNI:

All c_str calls in xapian-core are ok at 2004-12-05