Opened 4 years ago
Closed 4 years ago
#809 closed defect (invalid)
indexing CJK documents seems quite off
Reported by: | jay sun | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Other | Version: | 1.4.18 |
Severity: | normal | Keywords: | |
Cc: | jay sun | Blocked By: | |
Blocking: | Operating System: | Linux |
Description
Hi
I have a directory of docx files, and all contents are in Chinese.
I run recoll to search for a particular string, with only 5 matches. I made sure ckjoff=0 and cjkgramlen=3
I run docfetcher to search for the same string in the same set of Chinese documents, and found 60+ matches.
Therefore there must be some issue with recoll/xapian indexing of those documents.
Thanks Jay
Note:
See TracTickets
for help on using tickets.
Recoll doesn't use Xapian's CJK support - it uses its own code instead:
https://framagit.org/medoc92/recoll/-/blob/master/src/common/textsplit.cpp
I think this is because Xapian's CJK support was added after recoll's was.
Anyway, this isn't a bug in Xapian so you'll need to report this to the recoll developers.