#30 closed defect (released)
indexer, query parser and stemmer should handle UTF-8 data
Reported by: | Robert Pollak | Owned by: | Olly Betts |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | Omega | Version: | 0.8.0 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Olly said in <http://article.gmane.org/gmane.comp.search.xapian.general/613> "Support for Asian languages or Unicode?", that stemmers assume ISO-8859-1 (except for Russian which assumes KOI8-R). According to <http://www.snowball.tartarus.org/archives/snowball-discuss/0616.html> (dated May 10 2004), there are already UCS2-based unicode stemmers, but it seems snowball does not yet support UTF-8.
Change History (7)
comment:1 by , 20 years ago
rep_platform: | PC → All |
---|---|
Status: | new → assigned |
comment:2 by , 20 years ago
Most of the UTF-8 documents and queries that I have to process are in german language. Currently, I am converting the documents to ISO-8859-1 before stemming/indexing, and I am converting the query strings to ISO-8859-1 before passing them to the QueryParser. I am simply dropping the characters that can't be converted.
comment:4 by , 19 years ago
Snowball's stemmers now support utf-8, but we need to update to a newer version of snowball to get this.
I have a hacked QueryParser which handles utf-8. It introduces a dependency on glib, which is probably OTT just for utf-8 handling, so that probably needs fixing...
comment:6 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Now essentially fixed in SVN HEAD (the only real utf-8 work remaining is on the bindings, which aren't mentioned in this bug).
comment:7 by , 18 years ago
Operating System: | → All |
---|---|
Resolution: | fixed → released |
Fixed in 1.0.0 release.
Probably the best approach is to convert everything to a standard encoding. UTF-8 is a good candidate, at least for the languages we support stemming in
table tags (as we plan to) would counteract that.
But as you say, snowball doesn't yet support UTF-8. So either we are blocked on snowball supporting UTF-8, or we convert everything into what snowball does support, stem, and then convert to UTF-8.
Is your current problem with text in UTF-8 which could be represented in ISO-8859-1 and is in a language which we can stem?