Opened 21 years ago

Closed 18 years ago

Last modified 18 years ago

#30 closed defect (released)

indexer, query parser and stemmer should handle UTF-8 data

Reported by: Robert Pollak Owned by: Olly Betts
Priority: high Milestone:
Component: Omega Version: 0.8.0
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Olly said in <http://article.gmane.org/gmane.comp.search.xapian.general/613> "Support for Asian languages or Unicode?", that stemmers assume ISO-8859-1 (except for Russian which assumes KOI8-R). According to <http://www.snowball.tartarus.org/archives/snowball-discuss/0616.html> (dated May 10 2004), there are already UCS2-based unicode stemmers, but it seems snowball does not yet support UTF-8.

Change History (7)

comment:1 by Olly Betts, 20 years ago

rep_platform: PCAll
Status: newassigned

Probably the best approach is to convert everything to a standard encoding. UTF-8 is a good candidate, at least for the languages we support stemming in

  • it will bloat something like Chinese rather, but compressing the record

table tags (as we plan to) would counteract that.

But as you say, snowball doesn't yet support UTF-8. So either we are blocked on snowball supporting UTF-8, or we convert everything into what snowball does support, stem, and then convert to UTF-8.

Is your current problem with text in UTF-8 which could be represented in ISO-8859-1 and is in a language which we can stem?

comment:2 by Robert Pollak, 20 years ago

Most of the UTF-8 documents and queries that I have to process are in german language. Currently, I am converting the documents to ISO-8859-1 before stemming/indexing, and I am converting the query strings to ISO-8859-1 before passing them to the QueryParser. I am simply dropping the characters that can't be converted.

comment:3 by Olly Betts, 20 years ago

Relevant discussion on the snowball list:

http://thread.gmane.org/gmane.comp.search.snowball/668

comment:4 by Olly Betts, 19 years ago

Snowball's stemmers now support utf-8, but we need to update to a newer version of snowball to get this.

I have a hacked QueryParser which handles utf-8. It introduces a dependency on glib, which is probably OTT just for utf-8 handling, so that probably needs fixing...

comment:5 by Olly Betts, 18 years ago

Work is now well under way - for current status, see:

http://wiki.xapian.org/Utf8Support

comment:6 by Olly Betts, 18 years ago

Resolution: fixed
Status: assignedclosed

Now essentially fixed in SVN HEAD (the only real utf-8 work remaining is on the bindings, which aren't mentioned in this bug).

comment:7 by Olly Betts, 18 years ago

Operating System: All
Resolution: fixedreleased

Fixed in 1.0.0 release.

Note: See TracTickets for help on using tickets.