Opened 5 years ago
Closed 5 years ago
#809 closed defect (invalid)
indexing CJK documents seems quite off
| Reported by: | jay sun | Owned by: | Olly Betts |
|---|---|---|---|
| Priority: | normal | Milestone: | |
| Component: | Other | Version: | 1.4.18 |
| Severity: | normal | Keywords: | |
| Cc: | jay sun | Blocked By: | |
| Blocking: | Operating System: | Linux |
Description
Hi
I have a directory of docx files, and all contents are in Chinese.
I run recoll to search for a particular string, with only 5 matches. I made sure ckjoff=0 and cjkgramlen=3
I run docfetcher to search for the same string in the same set of Chinese documents, and found 60+ matches.
Therefore there must be some issue with recoll/xapian indexing of those documents.
Thanks Jay
Note:
See TracTickets
for help on using tickets.

Recoll doesn't use Xapian's CJK support - it uses its own code instead:
https://framagit.org/medoc92/recoll/-/blob/master/src/common/textsplit.cpp
I think this is because Xapian's CJK support was added after recoll's was.
Anyway, this isn't a bug in Xapian so you'll need to report this to the recoll developers.