Ticket #150 (assigned enhancement)

Opened 19 months ago

Last modified 7 months ago

Enhancements to Unicode support

Reported by: olly Owned by: olly
Priority: normal Milestone: 1.1.0
Component: QueryParser Version: SVN trunk
Severity: minor Keywords:
Cc: Blocked By:
Operating System: All Blocking:

Description (last modified by richard) (diff)

This bug is intended to just gather together enhancements we'd like to make to our Unicode support.

Currently I'm aware of two:

* Special cases for case conversion: http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Case_Mappings and in particular: http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing.txt

* Normalisation (mostly combining accents): http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Decompositions_and_Normalization

I'd imagine we would probably want to target most such changes at 1.1.0, for reasons of database compatibility. There are probably cases where it would be reasonable to implement such changes sooner though - if we build a different database in a case where the existing behaviour is poor, or the difference isn't problematic for some other reason, say.

Change History

Changed 19 months ago by olly

  • status changed from new to assigned

Another is word-splitting - currently we split rather simply by just considering certain characters to be "term characters" and allowing certain suffixes and "infixes". Unicode defines rules for identifying words, which we should probably use (probably with a few tweaks - for example, we want "C++" and "C#" and "AT&T" to be terms and the Unicode rules don't seem to count them as words):

http://www.unicode.org/reports/tr29/

Changed 18 months ago by olly

  • blocking set to 160

This is mostly (if not all) 1.1.0 material, so set to block bug#160.

Changed 18 months ago by trac

  • platform set to All

Changed 18 months ago by olly

Two items from Utf8Support on the wiki:

Perhaps scriptindex should support converting text from other encodings to UTF-8? This could be implemented in a backward compatible way in 1.0.x.

omindex assumes text files are UTF-8 (although the UTF-8 parsing falls back to ISO-8859-1 for invalid UTF-8 sequences and is used for both term and sample generation). But we could use "libmagic" to do "charset detection" (see also bug#114).

Changed 17 months ago by olly

Provide a way to specify the output encoding for Omega.

Changed 7 months ago by richard

  • description modified (diff)
  • milestone set to 1.1

Changed 7 months ago by richard

  • blocking deleted
Note: See TracTickets for help on using tickets.