Indexing characters outside the BMP can result in exception
|Reported by:||Richard Boulton||Owned by:||Olly Betts|
A document containing certain unicode characters can result in a zero length termname being generated by omega's query parser, which causes Xapian to throw an "Empty termnames aren't allowed." error.
The minimal example data file I've found is simply a document containing only the character 0x28a0f, which is a CJK Unified Ideograph. I have a fix, which I will commit shortly, which simply checks that the termlength isn't zero before adding it to the document, but it's possible that we should be fixing this by generating a term containing the unicode character, instead of just throwing it away.
Change History (6)
comment:4 by , 15 years ago
|Status:||assigned → closed|
|Summary:||Indexing CJK characters can result in exception → Indexing characters outside the BMP can result in exception|