Opened 16 years ago
Closed 16 years ago
#355 closed defect (fixed)
non-spacing chars are not term splitters
Reported by: | Muayyad Alsadi | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.0.12 |
Component: | QueryParser | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
I was evaluating the use of xapian to index Arabic documents and I noticed that terms are chopped off the reason is that chars like U+0651 ARABIC SHADDA (stress marker) which is in Unicode category as "Mark, Non-Spacing" are not marked by is_wordchar to be part of the word and thus the word would be split
the patch is trivial
thanks to Olly Betts (IRC:ojwb) for helping me on it
Attachments (1)
Change History (4)
by , 16 years ago
Attachment: | xapian-core-non-spacing.patch added |
---|
comment:1 by , 16 years ago
Component: | Other → QueryParser |
---|---|
Milestone: | → 1.1.0 |
Status: | new → assigned |
Version: | → SVN trunk |
We should fix this for 1.1.0, as it's going to make indexes built with and without it have incompatible terms, at least for those indexing/searching for data with such characters in.
Issues:
- This patch is simple and works pretty well, but ideally a space followed by a non-spacing mark shouldn't count as a space. In reality, we can probably ignore this for now - this approach is a definite improvement over the current handling.
- We should really be putting decomposed and decomposable characters into some canonical form so that the representation doesn't matter for matching. But we've already punted on that for this release series.
- We can't really make this change for 1.0.x, but we could make non-spacing marks phrase generators in the QueryParser so that <first part of word><SHADDA><second part of word> becomes a phrase search for "<first part of word> <second part of word>". That will work with existing databases, though it's not as good as the solution in this patch - e.g. all non-spacing marks will be equivalent.
comment:2 by , 16 years ago
Milestone: | 1.1.0 → 1.0.12 |
---|
Fixed as suggested in trunk r12344.
We should address this in 1.0.12 using the phrase-generator approach.
comment:3 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Fixed for 1.0.12 in r12346.
patch to fix non-spacing chars