Opened 14 years ago
Last modified 21 months ago
#514 new enhancement
Omega language detection with textcat
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Omega | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
This is a rough patch (which is unlikely to apply cleanly) split out from the monster patch in #282.
Work is still needed before we can consider applying this. In particular, using a different stemmer for each document like this is problematic (perhaps prefixing terms with the language would help there?)
Marking as 1.2.x as this could probably be done in a compatible way.
Attachments (2)
Change History (8)
by , 14 years ago
Attachment: | xapian-omega-language-detection-with-textcat.patch added |
---|
comment:1 by , 12 years ago
Milestone: | 1.2.x → 1.3.x |
---|
by , 10 years ago
Attachment: | xapian-omega-language-detection-with-textcat-updated.patch added |
---|
updated patch - compiles, but untested
comment:2 by , 10 years ago
I've updated the patch. What's really missing is a plan for handling multiple stemming languages sanely.
comment:4 by , 5 years ago
Milestone: | 1.4.x → 1.5.0 |
---|---|
Version: | SVN trunk → git master |
Seems the active libtextcat fork is libexttextcat (https://wiki.documentfoundation.org/Libexttextcat) - this one is packaged for Debian at least.
The patch needs updating to current git master and to use this (it looks like the API is the same, or not very different).
I think it would help if Xapian::Stem
's constructor could be told to treat unknown language codes as "none"
rather than throwing an exception, since then we could just set a stemmer based on the detected language.
We also still need a plan for handling multiple stemming languages.
If we add the stemmed terms as Zfoo
for each language then we can search unstemmed across the whole dataset, but a stemmed search needs to be filtered by the respective L
-prefix term. But this causes a stats contamination problem between terms in different languages unless we encode the language into the term prefix.
But we could have a separate database for each language - this seems more satisfactory, but care is needed to handle updated documents for which the detected language changes, and the consequences need working through.
I think this is too invasive for 1.4.x, so marking for 1.5.0.
comment:5 by , 5 years ago
I think it would help if
Xapian::Stem
's constructor could be told to treat unknown language codes as "none" rather than throwing an exception, since then we could just set a stemmer based on the detected language.
Implemented this in c71fff5b1a55279637d10995c844b24170f7eccc.
comment:6 by , 21 months ago
Milestone: | 1.5.0 → 2.0.0 |
---|
Rough patch for adding textcat support