Opened 18 years ago

Closed 9 years ago

Last modified 9 years ago

#180 closed enhancement (fixed)

Add support for CJK text to queryparser and termgenerator

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone: 1.2.22
Component: QueryParser Version: SVN trunk
Severity: normal Keywords:
Cc: xaka2004@… Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Some code to do this kind of tokenisation is now available at http://code.google.com/p/cjk-tokenizer/ which should probably be used as the basis for supporting this in Xapian.

We could add this as a QueryParser/TermGenerator option without breaking API compatibility. Marking for considering later in 1.1.x, but it could probably go in 1.2.x as it's likely to be ABI compatible too.

Attachments (3)

cjkv.patch (36.7 KB ) - added by Pavel Strashkin 15 years ago.
patch to add CJKV tokenizer supporting
cjk.patch (16.9 KB ) - added by Brandon Schaefer 14 years ago.
Replacing the old patch, uses the parser instead of lexer now.
cjk-ngram-applied-to-1.2-branch.patch (23.0 KB ) - added by Olly Betts 13 years ago.
Combined patch applied to 1.2 branch

Download all attachments as: .zip

Change History (36)

comment:1 by Richard Boulton, 18 years ago

Operating System: All
Status: newassigned

comment:3 by Olly Betts, 16 years ago

Description: modified (diff)
Milestone: 1.1.7
Type: defectenhancement

Fabrice Colin said on xapian-discuss:

Pinot uses a slightly modified version of Yung-Chung Lin's cjk-tokenizer that can be found at http://svn.berlios.de/wsvn/dijon/trunk/cjkv/CJKVTokenizer.cc For an example, see the XapianIndex and TokensIndexer classes at http://svn.berlios.de/wsvn/pinot/trunk/IndexSearch/Xapian/XapianIndex.cpp

comment:4 by Olly Betts, 16 years ago

Priority: normalhigh

This can probably be added without incompatible changes, but it would be good to have done.

comment:5 by Olly Betts, 15 years ago

Should be possible to do in an API and ABI compatible way, so bumping to stay on track.

comment:6 by Olly Betts, 15 years ago

Milestone: 1.1.71.2.0

Actually update the milestone...

comment:7 by Pavel Strashkin, 15 years ago

Cc: xaka2004@… added

Hi everyone! About month ago for company where i'm working was neccessary to add CJKV indexer to improve search mechanism. As backend we use Xapian and omega indexer.

I attached result of my work by integrating Dijon CJKVTokenizer into latest stable Xapian source tree (1.0.16). All tests passed, tokenizer works really great.

What i'm done:

  • added m4/pkg.m4 file to use pkg-config features to determine right CFLAGS and LIBS
  • with my patch Xapian depend on glib2 which uses in CJKV tokenizer to work with

unicode/utf-8

  • added checking for glib2 at configure time
  • expand LIBS and CFLAGS of xapian-config by glib2
  • added include/xapian/cjkv/CJKVTokenizer.h from Dijon (i leave Dijon namespace) with any touches
  • added queryparser/CJKVTokenizer.cc from Dijon without any touches
  • added modified QueryModifier which uses to modify input query (bigram model to split CJKV sequence to tokens, no changes for another parts of query). Its modifier uses at parser_query call time
  • added modified Indexer which uses in TermGenerator (bigram model to split CJKV sequence into terms)

To build Xapian you need:

  • call "aclocal" to regenerate aclocal.m4 and include added pkg.m4
  • call "autoconf"
  • call "automake"
  • be sure that you have install glib2
  • call "make"

I've modified 2 parts of Xapian: QueryParser::Internal::parse_query and TermGenerator::Interanl::index_text. As result you need just rebuild xapian-core and xapian-omega and i'll get CJKV.

comment:8 by Olly Betts, 15 years ago

Description: modified (diff)

Thanks for the patch - certainly a step forward.

There seem to be quite a lot of whitespace changes which make it harder to read. Can you regenerate it adding -bB to the diff options?

The new header shouldn't be under "include/xapian", since that's for the installed public API headers, but that's easy enough to fix.

It would be better to use Xapian's Unicode and UTF-8 support rather than adding a dependency on glib. Not just because adding avoidable dependencies is generally better, but also because there's scope for getting confused results if glib and Xapian's routines give different answers (as they might legitimately do if they are supporting different Unicode versions, or if invalid UTF-8 sequences are encountered).

I think it's probably better to have the user select "CJKV-mode". Exploding every string being indexed into a vector and then scanning it to see if CJKV characters are present is going to add a lot of overhead to everyone, even those indexing non-CJKV text. It also seems we don't want to completely change how we index (e.g.) English text which a Chinese name in. Alternatively, we could perhaps switch mode within a text string when we hit CJKV, and switch back when we hit non-CJKV.

by Pavel Strashkin, 15 years ago

Attachment: cjkv.patch added

patch to add CJKV tokenizer supporting

comment:9 by Pavel Strashkin, 15 years ago

Updated patch attached.

  1. Where i should put cjkv headers/sources files?
  1. Yes, glib2 dependency not good because Xapian already has Unicode/UTF-8 API. I agree, but i have no time while to completely rework cjkv code and because i've integrate Dijon's code "as is". One thing - Dijon/glib2 code will be used only if document has CJKV sequences, i.e. 99% backward compatible for non-CJKV documents :).
  1. How and where user should select CJKV-mode? What if user just have a big folder with many files which updates every day and every day this big folder is indexing. Or another example - international forums. There is no way to say "index this file/topic with CJKV-mode". We can try to optimize scanning and detecting CJKV sequence process.
  1. About your alternatively. Its already done in patch (if i'm right understand you). If indexable string doesn't have CJKV - will be used old algorithm.

Saying simple - "No CJKV - patch will not be used and all staying as is. If there CJKV - we will use modified queryparser/termgenerator code".

Lets continue discuss all things and i think i can help to complete integrate CJKV. Major work is done. Minor remains...

comment:10 by Olly Betts, 15 years ago

Sorry for not responding sooner, I'm insanely busy this month.

  1. Where i should put cjkv headers/sources files?

I'd suggest sticking the cjkv support in its own "cjkv" subdirectory, since it's essentially its own subsystem. Certainly "include" is only for headers visible to the end user, so that's not suitable.

  1. Yes, glib2 dependency not good because Xapian already has Unicode/UTF-8 API. I agree, but i have no time while to completely rework cjkv code and because i've integrate Dijon's code "as is". One thing - Dijon/glib2 code will be used only if document has CJKV sequences, i.e. 99% backward compatible for non-CJKV documents :).

I think this needs to be done before we can put this patch in a release, though I can probably sort it out when I'm less busy.

  1. How and where user should select CJKV-mode? What if user just have a big folder with many files which updates every day and every day this big folder is indexing. Or another example - international forums. There is no way to say "index this file/topic with CJKV-mode". We can try to optimize scanning and detecting CJKV sequence process.

In many cases the user knows they are handling particular languages, and then checking for CJKV is a waste of time. Conversely, you may only be handling CJKV, in which case checking is also pointless.

But in the "might be CJKV or might not" case, we certainly could be more efficient than converting the whole string to a vector and then scanning that. Xapian::Utf8Iterator would be a better approach.

  1. About your alternatively. Its already done in patch (if i'm right understand you). If indexable string doesn't have CJKV - will be used old algorithm.

I'm thinking of the case of a mixed document (a document without any CJKV characters is obviously easy to deal with, and similarly a document which is only CJKV is easy too).

I'm suggesting (perhaps) that if a document is in (say) English with quoted Chinese text, the English parts will be indexed as they currently are while the Chinese parts would be indexed with the CJKV rules, with the tokenizer switching between CJKV-mode and non-CJKV-mode as it goes. That avoids the need to decide whether such documents are "CJKV" or "non-CJKV", so there's no need to pre-scan them prior to actually indexing them.

Saying simple - "No CJKV - patch will not be used and all staying as is. If there CJKV - we will use modified queryparser/termgenerator code".

I think Xapian should have some sort of CJKV support, and this patch is a good start, but I do think it needs further work.

There's also the issue of the licence. Xapian is currently GPL, but we'd like to get to a position where we can relicense in the future. LGPL is a possible choice for the new licence, though we might want to go to a more liberal licence than that. I suspect this isn't a blocker, but we'd need to check with Fabrice.

comment:11 by Richard Boulton, 14 years ago

I did some work a while ago on tidying up cjk-tokenizer, based on the google code project, to use Xapian's unicode API. I'm not sure how helpful this is, but in case it's of use, I've just put it up on github, at https://github.com/rboulton/cjk-tokenizer.

comment:12 by Mikkel Kamstrup Erlandsen, 14 years ago

I've taken rboulton's branch from github and started to massageit into Xapian trunk. To add some some confusion to our VCS party you can find it in a bzr repo here https://code.launchpad.net/~kamstrup/xapian/cjk-support. As of writing it's still just baby steps - so nothing to see yet. When it is ready I'll post a patch here for those with bzr allergies :-)

I've extracted what I could from Olly's comments into the following battleplan:

  • Include the module in a new subdir xapian-code/queryparser/cjk/ (nothing in include/ until the API is acked)
  • Get rid of the excessive copying by using Utf8Iterator and on-the-fly conversions
  • Rely solely on Xapian's unicode handling (iow no new deps)
  • License wise - we talking X11/MIT licensed code. Maybe we can talk Richard and Yung-chung Lin into dual licensing under GPL as well as a stop gap measure?

comment:13 by Richard Boulton, 14 years ago

Sounds good - but there's no need to dual license, since X11/MIT is compatible with GPL - several files in Xapian are already licensed as X11/MIT. (I'd be happy to if it were necessary, but I'm pretty sure it isn't.)

comment:14 by Olly Betts, 14 years ago

Just to note, we have a GSoC project attempting to add indexing of Chinese text by segmentation this year.

The n-gram approach here is still going to be useful for at least Japanese and Korean though, and it might be useful as an alternative approach for Chinese.

comment:15 by Olly Betts, 14 years ago

Priority: highnormal

in reply to:  15 comment:16 by Brandon Schaefer, 14 years ago

I took over kamstrup branch and the main focus I gathered from him and your points were:

  • Avoiding any pre-scanning
  • Using the Utf8Iterators for the CJk tokenizing
  • Being able to switch back and forth
  • If all normal text no efficiency cost to having the CJK which the CJK checking is constant

Here my branch with trying to keep all those in mind: https://code.launchpad.net/~brandontschaefer/xapian/cjk-support-patch

Also a patch.

comment:17 by Olly Betts, 14 years ago

Thanks for your work on this - the latest patch is a big step forwards.

A couple of critical issues though:

  • This really needs test coverage to make sure it is actually working as intended, and continues to do so in the future.
  • The new code and changes to existing code don't match the coding style used in the Xapian codebase.

And some other thoughts, in no particular order:

  • queryparser/cjk/cjk-tokenizer.h isn't listed in queryparser/Makefile.mk, so won't get shipped. The cjk-LICENSE file won't get shipped either, but I think it would be better to insert that text at the start of each of the files it applies to, so it's completely clear what the licence of each file is.
  • I've not done any benchmarking, but I'm a little worried at the cost of CJK::codepoint_is_cjk() since it does a lot of comparisons and it looks like it gets called for every character. However, I notice that many of the ranges checked are actually contiguous and so could be merged into a smaller number of larger ranges - the compiler may do this for us, but it would be better not to rely on that. Also the last range is there twice! That reduces 23 ranges to 9 - if this still proves to be an issue, we can probably add an extra flag bit to the Unicode tables.
  • unicode_char_t - types ending _t are reserved by POSIX, so we shouldn't go creating new ones. Elsewhere we just use unsigned for this, so I'd suggest just doing the same.
  • ~tokenizer() is just the default empty destructor, so there's no benefit to explicitly defining it.
  • Generating the token stream TERM AND TERM ... AND TERM seems problematic - for example, there are higher precedence operators than AND, so there are probably cases which won't be parsed as you intend, and parse errors might end up confusingly referring to an AND when none appears in the query string. I'd suggest generating a CJKTERM token (or perhaps a TERM token with a "cjk" flag set) and then generating the n-grams when this token gets converted to a Query object.

in reply to:  17 comment:18 by Brandon Schaefer, 14 years ago

I took a look at the CJK::codepoint_is_cjk() and reduced it to 9 checks but also added this 'if (p < 0x2E80) return false'. So now if it runs into more common text it will just check once. It is still a constant so I don't see this as being an issue when it comes to speed, but I do see why this is very important if there is no CJK.

One question I have is about your last statement. It looks like where ever you generate the n-grams it is going to need an OP in-between them. If AND is not what you are looking for what else would you suggest putting between all the new terms generated? A couple idea's

  • Use the highest precedence OP or make a new CJK OP that also has the highest precedence for putting in-between after generating the n-grams. So any errors can report with the CJK OP.
  • Surround it with a brace ( BRA, KET ) but then again that is not in the query string, and use the default OP. This should provide what ever is generated to be stuck together even if a high precedence surrounds it.
  • Example (default OP_OR ): Xapian::Query((hello:(pos=1) AND (万:(pos=2) OR 万众:(pos=3) OR 众:(pos=4))))

The problem I see is what do you put in-between what is generated? I think the key point is whatever is generated from the n-gram can't be accidentally attached to something else with a higher precedence, which brackets should solve that. Also possibly having a cjk flag around to set for when an error might occur to refer to that instead of the OP put in-between.

I could also be overlooking something, but I am currently going the code to look for other options.

I'll also start adding some test cases to get a good test coverage to show it is working as intended and will continue to. Also will try to break it or produce un-intended results for the query parser.

Everything else you mentioned is changed already and will look out for any improper coding style to change.

Hope this can get fixed soon :)

comment:19 by Olly Betts, 14 years ago

The codepoint_is_cjk() overhead is constant, but that constant is per character processed, so "constant" is misleading - it's really O(n), where n is the size of text processed. And we often process a lot of characters. But indeed the most major concern right now is not to introduce a speed regression for everyone for this, and the early exit if < 0x2e80 pretty much guarantees that.

I already gave my suggested approach to the QueryParser tokenisation above, i.e.:

I'd suggest generating a CJKTERM token (or perhaps a TERM token with a "cjk" flag set) and then generating the n-grams when this token gets converted to a Query object.

In other words, don't generate the n-grams in the lexer, do it when the parser accepts this CJKTERM (or TERM with is_cjk set) and then just generate a Query object directly from the generated set on n-grams.

I think any approach which involves faking a token stream is potentially problematic.

A "CJK" operator might work, but just generating a single token is simpler. Why generate a special token stream in the lexer only to have to recognise it in the parser?

comment:20 by Brandon Schaefer, 14 years ago

Updated the patch to everything you wanted changed. The parser was a lot better of place to put it, I only added one rule to the CFG and it was: "compound_term(T) ::= CJKTERM(U)." then made a function under Term to produce the n-grams. I can also see where to add some more rules to get a better integration of CJK but for now this is a good start to see if this is the direction you want.

If there are other things you would like let me change let me know. I also added to the regression test for the term generator and query parser, though there are a few things that would still needed to be added to the grammar to allow for things such as NEAR to be used with CJK. Though this is a big step forward from no CJK allowed at all. It would also be helpful if you knew any one who was fluent in any of the CJK languages for better testing of semantics of the languages and more or less special cases and symbols. Then again the real solution to this is proper segmentation.

comment:21 by Olly Betts, 14 years ago

Milestone: 1.2.x1.3.0

Thanks - I'll give it a run through as soon as I can.

xaka: Are you fluent in any of these languages? If so, are you interested in taking a look at this?

Marking for 1.3.0.

comment:22 by Olly Betts, 14 years ago

A couple of very quick comments:

Clearly "include <cassert>" wasn't intended to be part of the licence text at the start of each file, but since it is the licence text I'd be happier to have an explicit OK for fixing that.

I think you must have tab display width set to 4, which probably explains most of the misidentation I noted before. The Xapian code assume tabs display as 8 spaces.

comment:23 by Brandon Schaefer, 14 years ago

Ok sorry about that, very new to the hacking world. Had vim set tab width to 4 for some reason, but now it's changed and pushed on the branch.

Also updated the patch with the -p option, forgot to add that in for the others.

Also removed the 'include <cassert>' in the license, must have slipped in there when merging the files.

by Brandon Schaefer, 14 years ago

Attachment: cjk.patch added

Replacing the old patch, uses the parser instead of lexer now.

comment:24 by Olly Betts, 13 years ago

I've committed the latest patch on a branch in git, cleaned up a few things, and fixed a bug with dereferencing an iterator before the end check:

https://github.com/ojwb/xapian-chinese-segmentation/tree/cjk-from-ticket-180

The plan is to merge in the Chinese segmentation support as well, since that needs to hook in in very similar places.

I noticed an issue with the term positions - currently the code blindly assigns a different position to every n-gram it generates, which doesn't seem a good approach. I'm not sure what the best approach is though. The key thing is we want phrases and the NEAR and ADJ operators to work in a natural way for users.

comment:25 by Olly Betts, 13 years ago

It looks to me like CJK::tokenizer::tokenize() will just ignore non-CJK characters, and for example produce a bi-gram from two CJK characters with arbitrary non-CJK between them.

But looking at where we call this method, it seems we only ever pass it all-CJK input.

So I propose we scrap the CJK test in there and just n-gram away on whatever the input is.

Thoughts?

comment:26 by Brandon Schaefer, 13 years ago

I've committed the latest patch on a branch in git, cleaned up a few things, and fixed a bug with dereferencing an iterator before the end check:

Thank you for cleaning it up and finding that bug.

I noticed an issue with the term positions - currently the code blindly assigns a different position to every n-gram it generates, which doesn't seem a good approach. I'm not sure what the best approach is though. The key thing is we want phrases and the NEAR and ADJ operators to work in a natural way for users.

As for the term positions that get assigned I was also unsure of the best option. I do agree that phrases, NEAR and ADJ operators should work with CJK chars and will be the next thing to integrate in. I think a better solution will come up when working on the phrases, NEAR and ADJ.

But looking at where we call this method, it seems we only ever pass it all-CJK input. So I propose we scrap the CJK test in there and just n-gram away on whatever the input is.

I saw that also, but left it in there 'just in case' but I agree fully that it is unnecessary since it only handles pure cjk.

comment:27 by Olly Betts, 13 years ago

OK, now we only add positional information for the single character CJK terms (which will save quite a bit of space).

And quoted phrases containing CJK now work - currently they just add each CJK character to the phrase as a separate term. We could also add the CJK bigrams as filters for the phrase, which should significantly cut down the number of cases we need to check positional data for, which will usually be faster, but that's just an efficiency tweak so I've left that for now. These changes are pushed to the git branch.

comment:28 by Olly Betts, 13 years ago

Dai Youli noted on IRC that mixed numbers like 2千3百 (two thousand three hundred) get indexed as four separate terms - while that's not terrible (since the same does at least happen at search time), it's not ideal either - searching for 2千3百 would find 3千2百, as well as documents containing those characters nowhere near each other.

Perhaps digits among CJK characters should be included in the span of text to be passed for n-gramming though.

comment:29 by Olly Betts, 13 years ago

I've applied the work from the git branch to SVN trunk (r16052 and r16053 to fix a memory leak I missed) and backported to 1.2 (r16054 and r16055). Currently this new code is only enabled if environment variable XAPIAN_CJK_NGRAM is set to a non-empty value. Ubuntu were keen to get this support for their upcoming release, and this way we can provide a compatible 1.2.x package for other applications, yet easily allow the applications which want it to have the CJK support. We'll want to sort out an API for enabling this (and any additional CJK modes which are added in the future) of course.

The substantial changes from the patch here are:

  • Fixed dereferencing of an end Utf8Iterator (probably harmless in practice)
  • Remove unnecessary extra check for a character being CJK
  • Only assign term positions to 1-grams (gives natural positions and reduces database size)
  • Quoted CJK phrases work
  • No intermediate vector of n-grams is built - saves space and is faster (~11% saving in total indexing time in a quick test)

by Olly Betts, 13 years ago

Combined patch applied to 1.2 branch

comment:30 by Olly Betts, 13 years ago

Milestone: 1.3.01.3.x

comment:31 by Olly Betts, 9 years ago

Milestone: 1.3.x1.3.4

So reviewing this ticket, it seems that the main thing that's actually left to do is adding a proper API for enabling this. I think a simple new flag for each of QueryParser and TermGenerator will do the trick nicely, and is entirely feasible to do for 1.3.4.

There's also the mixed digits and CJK thing mentioned in comment:28 - not sure if that's easy to adjust or not.

comment:32 by Olly Betts, 9 years ago

Milestone: 1.3.41.2.22

FLAG_CJK_NGRAM added in [e55890e7c485dd87d87c3b2f3c00ac7db4b91647].

And I've spun off the mixed numbers issue as #699.

The new flag seems worth backporting for 1.2.22, so setting milestone so we don't forget.

comment:33 by Olly Betts, 9 years ago

Follow-on commits to backport:

comment:34 by Olly Betts, 9 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.