Opened 20 years ago
Last modified 13 months ago
#46 assigned defect
zero byte cleanliness in C# and Java bindings
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Xapian-bindings | Version: | SVN trunk |
Severity: | minor | Keywords: | |
Cc: | Richard Boulton | Blocked By: | |
Blocking: | Operating System: | All |
Description (last modified by )
Current status:
Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as \0 in Java when returned; character >= U+10000 is misencoded as two surrogate pair codepoints encoded into UTF-8
Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when returned
C#: Truncates at \0 on input, which seems to be a SWIG or pinvoke limitation; character >= U+10000 seems to work (at least on Linux)
Original description:
Check for zero byte cleanness wherever strings are used. There are a number of c_str()s in the code, but I believe all in the core library are harmless at 2002-04-29. There may be other zero byte issues though. xapian-applications/dbtools also uses c_str() where it should probably use data() and length(). xapian-bindings hasn't been checked.
Attachments (1)
Change History (25)
comment:1 by , 20 years ago
Severity: | blocker → normal |
---|---|
Status: | new → assigned |
comment:2 by , 20 years ago
comment:4 by , 18 years ago
Component: | other → Other |
---|---|
op_sys: | other → All |
rep_platform: | Other → All |
Version: | other → SVN HEAD |
Most of the bindings now have a simple zero-byte cleanliness test.
The notable exception is C# which doesn't seem to easily support zero byte clean string passing currently.
Tcl encodes zero bytes as the non-normalised utf-8 form "\xc0\x80" so it's not actually possible to pass a zero byte into Xapian from Tcl...
comment:5 by , 18 years ago
Component: | Other → Xapian-bindings |
---|
comment:6 by , 18 years ago
Priority: | high → normal |
---|---|
Severity: | normal → minor |
Summary: | zero byte cleanliness → zero byte cleanliness in C# bindings |
comment:7 by , 18 years ago
Blocking: | 118 added |
---|
This should at least be investigated for the 1.0 release, though maybe it's too hard to fix in all bindings right now.
comment:8 by , 18 years ago
Summary: | zero byte cleanliness in C# bindings → zero byte cleanliness in C# and Java bindings |
---|
It's already fixed for everything except C# (oh, hmm and Java too) - there are tests in smoketest.* for it.
Tcl is fixed by virtue of not being able to have literal zero bytes in strings as they're encoded as non-normalised utf-8. Strictly speaking that should be addressed by checking for and translating '\xc0\x80' <-> '\0' as it's not valid utf-8 under current rules.
I asked on the SWIG list about C# ages ago and William Fulton seemed to suggest it was hard to fix for C# - the pinvoke stuff is built around C strings it seems.
Java is probably best fixed by transitioning to SWIG rather than expending effort on doomed code, but I'd rather leave moving it to SWIG until after 1.0. I don't think it's a huge amount of work, but it is a fundamental change that's potentially disruptive.
comment:9 by , 18 years ago
Cc: | added |
---|
comment:10 by , 18 years ago
Okay - since fixing this isn't going to break anything which already works, I think it's a prime candidate for postponing to after 1.0.
comment:11 by , 18 years ago
Blocking: | 118 removed |
---|
For Tcl, it would change what went in the database if you sent a "tcl nul", but if we ignore "\xc0\x80" in what comes out, it'll still essentially work OK (and I don't expect a lot of people put zero bytes in Xapian databases from Tcl anyway).
So I agree (removed block).
comment:12 by , 18 years ago
Blocking: | 120 added |
---|---|
Operating System: | → All |
comment:14 by , 17 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.1 |
comment:15 by , 17 years ago
Blocking: | 120 removed |
---|
(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.
comment:17 by , 16 years ago
Milestone: | 1.1.0 → 1.1.1 |
---|
These changes aren't incompatible, so bumping to 1.1.1.
comment:19 by , 15 years ago
Milestone: | 1.1.7 → 1.2.0 |
---|
This shouldn't require incompatible changes. While it would be better to be zero-byte clean, we've not had any actual user complaints or comments, so I'm going to bump this out of 1.1.x.
comment:20 by , 13 years ago
Description: | modified (diff) |
---|
SWIG-based Java isn't zero byte clean, but looking at the JNI methods used, I don't think the hand-coded JNI actually was on input (despite my comment above about it being so in 1.0.7).
Updated description with current status.
comment:21 by , 13 years ago
Description: | modified (diff) |
---|
I must have had an unclean tree or something. SWIG-based Java bindings are just like Tcl - from the Java side they appear zero-byte clean, but actually in C++ we see \xc0\x80 for Java \0.
I tried a quick patch to use GetStringCritical() and convert to UTF-8 ourselves using's Xapian's Unicode support, which would mean \0 in Java <-> \0 in C++, and also would convert surrogate pairs in Java's representation properly to/from UTF-8 in C++. However, timing some operations which do a lot of string passing this is twice as slow, so I'm parking it for now. I'll attach it here so it doesn't get lost.
Added a Java testcase in r15917, which just checks that the roundtripping works.
by , 13 years ago
Attachment: | java-use-string-critical-twice-as-slow-though.patch added |
---|
Patch to do proper character set conversion for Java
comment:22 by , 13 years ago
Milestone: | 1.2.x → 1.3.x |
---|
It would be good to look at these issues for 1.3.x, particularly seeing if the Java patch can be made to work faster.
comment:24 by , 21 months ago
Description: | modified (diff) |
---|
Testing with current git master:
- Java: Still passes UTF-8 containing
\xc0\x80
and codepoints >= U+10000 are encoded as high and low surrogate pair codepoints in UTF-8, which is invalid - csharp: Still truncates at a zero byte; codepoints >= U+10000 seem to work (on linux at least)
- Tcl: Still passes UTF-8 containing
\xc0\x80
A note on testing this: Looking at doc.get_description()
can reveal zero bytes being passed with an invalid encoding as we get Document(docid=0, data=a\xc0\x80b)
for input a\0b
.
The same trick didn't work for surrogate pairs because get_description()
didn't escape them and they converted back to the same Java string we fed in. However this is really a bug - we should treat surrogate pair code units encoded as UTF-8 as invalid, like we already do for overlong sequences. [d390be27767642fae189b10d82e902d5cb2f30ed] implements this change, and this trick now works for testing the surrogate case too.
[b8509b41c920e36ecfdd8a1a0b95e73f5f738b66] adds testcases where we don't already have them (with crude XPASS if the cases we know to fail should change to actually pass).
comment:25 by , 13 months ago
Milestone: | 1.4.x → 2.0.0 |
---|
Postponing. Could likely be backported to a stable release once implemented.
For Java a possible approach is described here: https://github.com/swig/swig/issues/2597#issuecomment-1728981776
it just occurred to me that you could create a Java string in JNI by calling one of the following String constructors from JNI:
All c_str calls in xapian-core are ok at 2004-12-05