Opened 14 years ago

Last modified 20 months ago

#514 new enhancement

Omega language detection with textcat

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Omega Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

This is a rough patch (which is unlikely to apply cleanly) split out from the monster patch in #282.

Work is still needed before we can consider applying this. In particular, using a different stemmer for each document like this is problematic (perhaps prefixing terms with the language would help there?)

Marking as 1.2.x as this could probably be done in a compatible way.

Attachments (2)

xapian-omega-language-detection-with-textcat.patch (67.6 KB ) - added by Olly Betts 14 years ago.
Rough patch for adding textcat support
xapian-omega-language-detection-with-textcat-updated.patch (67.3 KB ) - added by Olly Betts 9 years ago.
updated patch - compiles, but untested

Download all attachments as: .zip

Change History (8)

by Olly Betts, 14 years ago

Rough patch for adding textcat support

comment:1 by Olly Betts, 12 years ago

Milestone: 1.2.x1.3.x

by Olly Betts, 9 years ago

updated patch - compiles, but untested

comment:2 by Olly Betts, 9 years ago

I've updated the patch. What's really missing is a plan for handling multiple stemming languages sanely.

comment:3 by Olly Betts, 9 years ago

Milestone: 1.3.x1.4.x

Not a blocker for 1.4.0.

comment:4 by Olly Betts, 5 years ago

Milestone: 1.4.x1.5.0
Version: SVN trunkgit master

Seems the active libtextcat fork is libexttextcat (https://wiki.documentfoundation.org/Libexttextcat) - this one is packaged for Debian at least.

The patch needs updating to current git master and to use this (it looks like the API is the same, or not very different).

I think it would help if Xapian::Stem's constructor could be told to treat unknown language codes as "none" rather than throwing an exception, since then we could just set a stemmer based on the detected language.

We also still need a plan for handling multiple stemming languages.

If we add the stemmed terms as Zfoo for each language then we can search unstemmed across the whole dataset, but a stemmed search needs to be filtered by the respective L-prefix term. But this causes a stats contamination problem between terms in different languages unless we encode the language into the term prefix.

But we could have a separate database for each language - this seems more satisfactory, but care is needed to handle updated documents for which the detected language changes, and the consequences need working through.

I think this is too invasive for 1.4.x, so marking for 1.5.0.

comment:5 by Olly Betts, 5 years ago

I think it would help if Xapian::Stem's constructor could be told to treat unknown language codes as "none" rather than throwing an exception, since then we could just set a stemmer based on the detected language.

Implemented this in c71fff5b1a55279637d10995c844b24170f7eccc.

comment:6 by Olly Betts, 20 months ago

Milestone: 1.5.02.0.0
Note: See TracTickets for help on using tickets.