Opened 16 years ago

Closed 16 years ago

#355 closed defect (fixed)

non-spacing chars are not term splitters

Reported by: Muayyad Alsadi Owned by: Olly Betts
Priority: normal Milestone: 1.0.12
Component: QueryParser Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

I was evaluating the use of xapian to index Arabic documents and I noticed that terms are chopped off the reason is that chars like U+0651 ARABIC SHADDA (stress marker) which is in Unicode category as "Mark, Non-Spacing" are not marked by is_wordchar to be part of the word and thus the word would be split

the patch is trivial

thanks to Olly Betts (IRC:ojwb) for helping me on it

Attachments (1)

xapian-core-non-spacing.patch (495 bytes ) - added by Muayyad Alsadi 16 years ago.
patch to fix non-spacing chars

Download all attachments as: .zip

Change History (4)

by Muayyad Alsadi, 16 years ago

patch to fix non-spacing chars

comment:1 by Olly Betts, 16 years ago

Component: OtherQueryParser
Milestone: 1.1.0
Status: newassigned
Version: SVN trunk

We should fix this for 1.1.0, as it's going to make indexes built with and without it have incompatible terms, at least for those indexing/searching for data with such characters in.

Issues:

  • This patch is simple and works pretty well, but ideally a space followed by a non-spacing mark shouldn't count as a space. In reality, we can probably ignore this for now - this approach is a definite improvement over the current handling.
  • We should really be putting decomposed and decomposable characters into some canonical form so that the representation doesn't matter for matching. But we've already punted on that for this release series.
  • We can't really make this change for 1.0.x, but we could make non-spacing marks phrase generators in the QueryParser so that <first part of word><SHADDA><second part of word> becomes a phrase search for "<first part of word> <second part of word>". That will work with existing databases, though it's not as good as the solution in this patch - e.g. all non-spacing marks will be equivalent.

comment:2 by Olly Betts, 16 years ago

Milestone: 1.1.01.0.12

Fixed as suggested in trunk r12344.

We should address this in 1.0.12 using the phrase-generator approach.

comment:3 by Olly Betts, 16 years ago

Resolution: fixed
Status: assignedclosed

Fixed for 1.0.12 in r12346.

Note: See TracTickets for help on using tickets.