Opened 14 years ago

Last modified 2 years ago

#46 assigned defect

zero byte cleanliness in C# and Java bindings

Reported by: olly Owned by: olly
Priority: normal Milestone: 1.4.x
Component: Xapian-bindings Version: SVN trunk
Severity: minor Keywords:
Cc: richard Blocked By:
Blocking: Operating System: All

Description (last modified by olly)

Current status:

Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as \0 in Java when returned

Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when returned

C#: Truncates at \0 on input


Original description:

Check for zero byte cleanness wherever strings are used. There are a number of c_str()s in the code, but I believe all in the core library are harmless at 2002-04-29. There may be other zero byte issues though. xapian-applications/dbtools also uses c_str() where it should probably use data() and length(). xapian-bindings hasn't been checked.

Attachments (1)

java-use-string-critical-twice-as-slow-though.patch (2.1 KB) - added by olly 7 years ago.
Patch to do proper character set conversion for Java

Download all attachments as: .zip

Change History (23)

comment:1 Changed 14 years ago by olly

  • Severity changed from blocker to normal
  • Status changed from new to assigned

comment:2 Changed 14 years ago by olly

All c_str calls in xapian-core are ok at 2004-12-05

comment:3 Changed 14 years ago by olly

No calls to c_str in xapian-examples!

comment:4 Changed 12 years ago by olly

  • Component changed from other to Other
  • op_sys changed from other to All
  • rep_platform changed from Other to All
  • Version changed from other to SVN HEAD

Most of the bindings now have a simple zero-byte cleanliness test.

The notable exception is C# which doesn't seem to easily support zero byte clean string passing currently.

Tcl encodes zero bytes as the non-normalised utf-8 form "\xc0\x80" so it's not actually possible to pass a zero byte into Xapian from Tcl...

comment:5 Changed 12 years ago by olly

  • Component changed from Other to Xapian-bindings

comment:6 Changed 12 years ago by olly

  • Priority changed from high to normal
  • Severity changed from normal to minor
  • Summary changed from zero byte cleanliness to zero byte cleanliness in C# bindings

comment:7 Changed 12 years ago by richard

  • Blocking 118 added

This should at least be investigated for the 1.0 release, though maybe it's too hard to fix in all bindings right now.

comment:8 Changed 12 years ago by olly

  • Summary changed from zero byte cleanliness in C# bindings to zero byte cleanliness in C# and Java bindings

It's already fixed for everything except C# (oh, hmm and Java too) - there are tests in smoketest.* for it.

Tcl is fixed by virtue of not being able to have literal zero bytes in strings as they're encoded as non-normalised utf-8. Strictly speaking that should be addressed by checking for and translating '\xc0\x80' <-> '\0' as it's not valid utf-8 under current rules.

I asked on the SWIG list about C# ages ago and William Fulton seemed to suggest it was hard to fix for C# - the pinvoke stuff is built around C strings it seems.

Java is probably best fixed by transitioning to SWIG rather than expending effort on doomed code, but I'd rather leave moving it to SWIG until after 1.0. I don't think it's a huge amount of work, but it is a fundamental change that's potentially disruptive.

comment:9 Changed 12 years ago by richard

  • Cc richard@… added

comment:10 Changed 12 years ago by richard

Okay - since fixing this isn't going to break anything which already works, I think it's a prime candidate for postponing to after 1.0.

comment:11 Changed 12 years ago by olly

  • Blocking 118 removed

For Tcl, it would change what went in the database if you sent a "tcl nul", but if we ignore "\xc0\x80" in what comes out, it'll still essentially work OK (and I don't expect a lot of people put zero bytes in Xapian databases from Tcl anyway).

So I agree (removed block).

comment:12 Changed 12 years ago by richard

  • Blocking 120 added
  • Operating System set to All

comment:14 Changed 11 years ago by richard

  • Description modified (diff)
  • Milestone set to 1.1

comment:15 Changed 11 years ago by richard

  • Blocking 120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

comment:16 Changed 10 years ago by olly

As of 1.0.7, passing strings from Java to C++ is now zero byte safe.

comment:17 Changed 10 years ago by olly

  • Milestone changed from 1.1.0 to 1.1.1

These changes aren't incompatible, so bumping to 1.1.1.

comment:18 Changed 10 years ago by olly

  • Milestone changed from 1.1.1 to 1.1.7

Triaging milestone:1.1.1 bugs.

comment:19 Changed 9 years ago by olly

  • Milestone changed from 1.1.7 to 1.2.0

This shouldn't require incompatible changes. While it would be better to be zero-byte clean, we've not had any actual user complaints or comments, so I'm going to bump this out of 1.1.x.

comment:20 Changed 7 years ago by olly

  • Description modified (diff)

SWIG-based Java isn't zero byte clean, but looking at the JNI methods used, I don't think the hand-coded JNI actually was on input (despite my comment above about it being so in 1.0.7).

Updated description with current status.

comment:21 Changed 7 years ago by olly

  • Description modified (diff)

I must have had an unclean tree or something. SWIG-based Java bindings are just like Tcl - from the Java side they appear zero-byte clean, but actually in C++ we see \xc0\x80 for Java \0.

I tried a quick patch to use GetStringCritical?() and convert to UTF-8 ourselves using's Xapian's Unicode support, which would mean \0 in Java <-> \0 in C++, and also would convert surrogate pairs in Java's representation properly to/from UTF-8 in C++. However, timing some operations which do a lot of string passing this is twice as slow, so I'm parking it for now. I'll attach it here so it doesn't get lost.

Added a Java testcase in r15917, which just checks that the roundtripping works.

Last edited 2 years ago by olly (previous) (diff)

Changed 7 years ago by olly

Patch to do proper character set conversion for Java

comment:22 Changed 7 years ago by olly

  • Milestone changed from 1.2.x to 1.3.x

It would be good to look at these issues for 1.3.x, particularly seeing if the Java patch can be made to work faster.

comment:23 Changed 4 years ago by olly

  • Milestone changed from 1.3.x to 1.4.x

This isn't worth holding up 1.4.0 for.

Note: See TracTickets for help on using tickets.