#106 closed defect (released)
Indexing characters outside the BMP can result in exception
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Omega | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
A document containing certain unicode characters can result in a zero length termname being generated by omega's query parser, which causes Xapian to throw an "Empty termnames aren't allowed." error.
The minimal example data file I've found is simply a document containing only the character 0x28a0f, which is a CJK Unified Ideograph. I have a fix, which I will commit shortly, which simply checks that the termlength isn't zero before adding it to the document, but it's possible that we should be fixing this by generating a term containing the unicode character, instead of just throwing it away.
Attachments (1)
Change History (6)
by , 18 years ago
comment:1 by , 18 years ago
Status: | new → assigned |
---|
Actually it's not CJK, but rather a "Unified Han Ideograph" according to the Unicode charts. Note that this is outside the BMP, so not supported by the unicode routines we're using (taken from Tcl).
However, we shouldn't ever generate a zero length term even when presented with characters outside the range we currently handle.
comment:2 by , 18 years ago
I've been mulling this over - I think the best answer for now is probably to assume any characters outside of the BMP are word characters. Eventually we ought to sort out unicode routines which handle such characters fully, but I think it's more important to focus on getting 1.0 out with very good unicode support than try to aim for perfect unicode support and delay the release.
What was the document you encountered this character in? Did it just have a few characters outside the BMP?
comment:3 by , 18 years ago
Yes, IIRC the document I encountered this in had only this one character outside the BMP: I'm afraid I don't have the actual document to hand.
Treating it as a word character seems the right approach for now to me, too. In this particular example, it wouldn't have mattered particularly whether the character could be found - it just mattered that it threw an error.
comment:4 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Summary: | Indexing CJK characters can result in exception → Indexing characters outside the BMP can result in exception |
OK, I've now committed a fix which assumes characters outside the BMP are word characters, but can't be forced to lowercase.
And looking at the check for zero length terms, that shouldn't be required because we should never be able to arrive at the check with an empty term. But we do, even with the fix above.
The real cause of this bug is that my utf-8 code was decoding 4 byte sequences incorrectly, so when we try to convert a too-large value back to utf-8, we get an empty string. I've committed a fix for this incorrect decoding, so this is fixed in SVN HEAD.
comment:5 by , 18 years ago
Operating System: | → All |
---|---|
Resolution: | fixed → released |
Fixed in 1.0.0 release, which now knows about characters outside the BMP.
Minimal test data file