Opened 3 years ago

Closed 3 years ago

#809 closed defect (invalid)

indexing CJK documents seems quite off

Reported by: jay sun Owned by: Olly Betts
Priority: normal Milestone:
Component: Other Version: 1.4.18
Severity: normal Keywords:
Cc: jay sun Blocked By:
Blocking: Operating System: Linux

Description

Hi

I have a directory of docx files, and all contents are in Chinese.

I run recoll to search for a particular string, with only 5 matches. I made sure ckjoff=0 and cjkgramlen=3

I run docfetcher to search for the same string in the same set of Chinese documents, and found 60+ matches.

Therefore there must be some issue with recoll/xapian indexing of those documents.

Thanks Jay

Change History (1)

comment:1 by Olly Betts, 3 years ago

Resolution: invalid
Status: newclosed

Recoll doesn't use Xapian's CJK support - it uses its own code instead:

https://framagit.org/medoc92/recoll/-/blob/master/src/common/textsplit.cpp

I think this is because Xapian's CJK support was added after recoll's was.

Anyway, this isn't a bug in Xapian so you'll need to report this to the recoll developers.

Note: See TracTickets for help on using tickets.