Ticket #180 (assigned enhancement)

Opened 3 years ago

Last modified 11 months ago

Add support for CJK text to queryparser and termgenerator

Reported by: richard Owned by: richard
Priority: high Milestone: 1.2.x
Component: QueryParser Version: SVN trunk
Severity: normal Keywords:
Cc: xaka2004@… Blocked By:
Operating System: All Blocking:

Description (last modified by olly) (diff)

Some code to do this kind of tokenisation is now available at  http://code.google.com/p/cjk-tokenizer/ which should probably be used as the basis for supporting this in Xapian.

We could add this as a QueryParser/TermGenerator option without breaking API compatibility. Marking for considering later in 1.1.x, but it could probably go in 1.2.x as it's likely to be ABI compatible too.

Attachments

cjkv.patch Download (36.7 KB) - added by xaka 12 months ago.
patch to add CJKV tokenizer supporting

Change History

Changed 3 years ago by richard

  • status changed from new to assigned

Changed 3 years ago by trac

  • platform set to All

Changed 17 months ago by olly

  • type changed from defect to enhancement
  • description modified (diff)
  • milestone set to 1.1.7

Fabrice Colin said on xapian-discuss:

Pinot uses a slightly modified version of Yung-Chung Lin's cjk-tokenizer that can be found at  http://svn.berlios.de/wsvn/dijon/trunk/cjkv/CJKVTokenizer.cc For an example, see the XapianIndex? and TokensIndexer? classes at  http://svn.berlios.de/wsvn/pinot/trunk/IndexSearch/Xapian/XapianIndex.cpp

Changed 14 months ago by olly

  • priority changed from normal to high

This can probably be added without incompatible changes, but it would be good to have done.

Changed 13 months ago by olly

Should be possible to do in an API and ABI compatible way, so bumping to stay on track.

Changed 13 months ago by olly

  • milestone changed from 1.1.7 to 1.2.0

Actually update the milestone...

Changed 12 months ago by xaka

  • cc xaka2004@… added

Hi everyone! About month ago for company where i'm working was neccessary to add CJKV indexer to improve search mechanism. As backend we use Xapian and omega indexer.

I attached result of my work by integrating Dijon CJKVTokenizer into latest stable Xapian source tree (1.0.16). All tests passed, tokenizer works really great.

What i'm done:

* added m4/pkg.m4 file to use pkg-config features to determine right CFLAGS and LIBS

* with my patch Xapian depend on glib2 which uses in CJKV tokenizer to work with unicode/utf-8

* added checking for glib2 at configure time

* expand LIBS and CFLAGS of xapian-config by glib2

* added include/xapian/cjkv/CJKVTokenizer.h from Dijon (i leave Dijon namespace) with any touches

* added queryparser/CJKVTokenizer.cc from Dijon without any touches

* added modified QueryModifier? which uses to modify input query (bigram model to split CJKV sequence to tokens, no changes for another parts of query). Its modifier uses at parser_query call time

* added modified Indexer which uses in TermGenerator? (bigram model to split CJKV sequence into terms)

To build Xapian you need:

* call "aclocal" to regenerate aclocal.m4 and include added pkg.m4

* call "autoconf"

* call "automake"

* be sure that you have install glib2

* call "make"

I've modified 2 parts of Xapian: QueryParser::Internal::parse_query and TermGenerator::Interanl::index_text. As result you need just rebuild xapian-core and xapian-omega and i'll get CJKV.

Changed 12 months ago by olly

  • description modified (diff)

Thanks for the patch - certainly a step forward.

There seem to be quite a lot of whitespace changes which make it harder to read. Can you regenerate it adding -bB to the diff options?

The new header shouldn't be under "include/xapian", since that's for the installed public API headers, but that's easy enough to fix.

It would be better to use Xapian's Unicode and UTF-8 support rather than adding a dependency on glib. Not just because adding avoidable dependencies is generally better, but also because there's scope for getting confused results if glib and Xapian's routines give different answers (as they might legitimately do if they are supporting different Unicode versions, or if invalid UTF-8 sequences are encountered).

I think it's probably better to have the user select "CJKV-mode". Exploding every string being indexed into a vector and then scanning it to see if CJKV characters are present is going to add a lot of overhead to everyone, even those indexing non-CJKV text. It also seems we don't want to completely change how we index (e.g.) English text which a Chinese name in. Alternatively, we could perhaps switch mode within a text string when we hit CJKV, and switch back when we hit non-CJKV.

Changed 12 months ago by xaka

patch to add CJKV tokenizer supporting

Changed 12 months ago by xaka

Updated patch attached.

1. Where i should put cjkv headers/sources files?

2. Yes, glib2 dependency not good because Xapian already has Unicode/UTF-8 API. I agree, but i have no time while to completely rework cjkv code and because i've integrate Dijon's code "as is". One thing - Dijon/glib2 code will be used only if document has CJKV sequences, i.e. 99% backward compatible for non-CJKV documents :).

3. How and where user should select CJKV-mode? What if user just have a big folder with many files which updates every day and every day this big folder is indexing. Or another example - international forums. There is no way to say "index this file/topic with CJKV-mode". We can try to optimize scanning and detecting CJKV sequence process.

4. About your alternatively. Its already done in patch (if i'm right understand you). If indexable string doesn't have CJKV - will be used old algorithm.

Saying simple - "No CJKV - patch will not be used and all staying as is. If there CJKV - we will use modified queryparser/termgenerator code".

Lets continue discuss all things and i think i can help to complete integrate CJKV. Major work is done. Minor remains...

Changed 11 months ago by olly

Sorry for not responding sooner, I'm insanely busy this month.

1. Where i should put cjkv headers/sources files?

I'd suggest sticking the cjkv support in its own "cjkv" subdirectory, since it's essentially its own subsystem. Certainly "include" is only for headers visible to the end user, so that's not suitable.

2. Yes, glib2 dependency not good because Xapian already has Unicode/UTF-8 API. I agree, but i have no time while to completely rework cjkv code and because i've integrate Dijon's code "as is". One thing - Dijon/glib2 code will be used only if document has CJKV sequences, i.e. 99% backward compatible for non-CJKV documents :).

I think this needs to be done before we can put this patch in a release, though I can probably sort it out when I'm less busy.

3. How and where user should select CJKV-mode? What if user just have a big folder with many files which updates every day and every day this big folder is indexing. Or another example - international forums. There is no way to say "index this file/topic with CJKV-mode". We can try to optimize scanning and detecting CJKV sequence process.

In many cases the user knows they are handling particular languages, and then checking for CJKV is a waste of time. Conversely, you may only be handling CJKV, in which case checking is also pointless.

But in the "might be CJKV or might not" case, we certainly could be more efficient than converting the whole string to a vector and then scanning that. Xapian::Utf8Iterator would be a better approach.

4. About your alternatively. Its already done in patch (if i'm right understand you). If indexable string doesn't have CJKV - will be used old algorithm.

I'm thinking of the case of a mixed document (a document without any CJKV characters is obviously easy to deal with, and similarly a document which is only CJKV is easy too).

I'm suggesting (perhaps) that if a document is in (say) English with quoted Chinese text, the English parts will be indexed as they currently are while the Chinese parts would be indexed with the CJKV rules, with the tokenizer switching between CJKV-mode and non-CJKV-mode as it goes. That avoids the need to decide whether such documents are "CJKV" or "non-CJKV", so there's no need to pre-scan them prior to actually indexing them.

Saying simple - "No CJKV - patch will not be used and all staying as is. If there CJKV - we will use modified queryparser/termgenerator code".

I think Xapian should have some sort of CJKV support, and this patch is a good start, but I do think it needs further work.

There's also the issue of the licence. Xapian is currently GPL, but we'd like to get to a position where we can relicense in the future. LGPL is a possible choice for the new licence, though we might want to go to a more liberal licence than that. I suspect this isn't a blocker, but we'd need to check with Fabrice.

Note: See TracTickets for help on using tickets.