Opened 16 years ago

Closed 15 years ago

Last modified 15 years ago

#106 closed defect (released)

Indexing characters outside the BMP can result in exception

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone:
Component: Omega Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

A document containing certain unicode characters can result in a zero length termname being generated by omega's query parser, which causes Xapian to throw an "Empty termnames aren't allowed." error.

The minimal example data file I've found is simply a document containing only the character 0x28a0f, which is a CJK Unified Ideograph. I have a fix, which I will commit shortly, which simply checks that the termlength isn't zero before adding it to the document, but it's possible that we should be fixing this by generating a term containing the unicode character, instead of just throwing it away.

Attachments (1)

tmp2.dump (10 bytes ) - added by Richard Boulton 16 years ago.
Minimal test data file

Download all attachments as: .zip

Change History (6)

by Richard Boulton, 16 years ago

Attachment: tmp2.dump added

Minimal test data file

comment:1 by Olly Betts, 15 years ago

Status: newassigned

Actually it's not CJK, but rather a "Unified Han Ideograph" according to the Unicode charts. Note that this is outside the BMP, so not supported by the unicode routines we're using (taken from Tcl).

However, we shouldn't ever generate a zero length term even when presented with characters outside the range we currently handle.

comment:2 by Olly Betts, 15 years ago

I've been mulling this over - I think the best answer for now is probably to assume any characters outside of the BMP are word characters. Eventually we ought to sort out unicode routines which handle such characters fully, but I think it's more important to focus on getting 1.0 out with very good unicode support than try to aim for perfect unicode support and delay the release.

What was the document you encountered this character in? Did it just have a few characters outside the BMP?

comment:3 by Richard Boulton, 15 years ago

Yes, IIRC the document I encountered this in had only this one character outside the BMP: I'm afraid I don't have the actual document to hand.

Treating it as a word character seems the right approach for now to me, too. In this particular example, it wouldn't have mattered particularly whether the character could be found - it just mattered that it threw an error.

comment:4 by Olly Betts, 15 years ago

Resolution: fixed
Status: assignedclosed
Summary: Indexing CJK characters can result in exceptionIndexing characters outside the BMP can result in exception

OK, I've now committed a fix which assumes characters outside the BMP are word characters, but can't be forced to lowercase.

And looking at the check for zero length terms, that shouldn't be required because we should never be able to arrive at the check with an empty term. But we do, even with the fix above.

The real cause of this bug is that my utf-8 code was decoding 4 byte sequences incorrectly, so when we try to convert a too-large value back to utf-8, we get an empty string. I've committed a fix for this incorrect decoding, so this is fixed in SVN HEAD.

comment:5 by Olly Betts, 15 years ago

Operating System: All
Resolution: fixedreleased

Fixed in 1.0.0 release, which now knows about characters outside the BMP.

Note: See TracTickets for help on using tickets.