Opened 17 years ago
Last modified 10 months ago
#150 assigned enhancement
Enhancements to Unicode support
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | QueryParser | Version: | git master |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
This bug is intended to just gather together enhancements we'd like to make to our Unicode support.
Currently I'm aware of:
- Special cases for case conversion: http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Case_Mappings and in particular: http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing.txt
- Normalisation (mostly combining accents): http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Decompositions_and_Normalization
- Unicode has rules for indentifying word boundaries, which we should investigate and perhaps use more of. For example, we currently handle a space followed by a non-spacing mark wrongly.
I'd imagine we would probably want to target most such changes at a ".0" release, for reasons of database compatibility. There are probably cases where it would be reasonable to implement such changes sooner though - if we build a different database in a case where the existing behaviour is poor, or the difference isn't problematic for some other reason, say.
Change History (12)
comment:1 by , 17 years ago
Status: | new → assigned |
---|
comment:2 by , 17 years ago
Blocking: | 160 added |
---|---|
Operating System: | → All |
This is mostly (if not all) 1.1.0 material, so set to block bug#160.
comment:3 by , 17 years ago
Two items from Utf8Support on the wiki:
Perhaps scriptindex should support converting text from other encodings to UTF-8? This could be implemented in a backward compatible way in 1.0.x.
omindex assumes text files are UTF-8 (although the UTF-8 parsing falls back to ISO-8859-1 for invalid UTF-8 sequences and is used for both term and sample generation). But we could use "libmagic" to do "charset detection" (see also bug#114).
comment:6 by , 17 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.1 |
comment:7 by , 17 years ago
Blocking: | 160 removed |
---|
comment:8 by , 16 years ago
Milestone: | 1.1.0 → 2.0.0 |
---|
Pushing back to milestone:2.0.0 though that might mean 1.3.0 development for a 1.4.0 release - really I'm just saying "not for 1.1.x or 1.2.x".
Update to comment:4 - omindex now checks for a BOM in text files.
comment:9 by , 16 years ago
Description: | modified (diff) |
---|
comment:10 by , 16 years ago
Description: | modified (diff) |
---|
comment:11 by , 13 years ago
FYI, I'm using Xapian, and I 'flatten' (normalize) strings before adding them as terms; my table-based implementation:
http://gitorious.org/mu/mu-ng/blobs/master/src/mu-str-normalize.c
it's sufficient for most latin-based accented character, and the strong point (for speed/mem usage) is that it can flatten the strings _in place_.
For a more complete (and shorter) version, some of equivalent of g_str_normalize could be used, where first the accents and strings are separated, and after that the accent chars are removed.
comment:12 by , 17 months ago
omindex assumes text files are UTF-8 (although the UTF-8 parsing falls back to ISO-8859-1 for invalid UTF-8 sequences and is used for both term and sample generation). But we could use "libmagic" to do "charset detection"
I had a quick look at doing so, but basically libmagic isn't actually useful for what we want - it seems to either say binary
, us-ascii
, iso-8859-1
, utf-8
or unknown-8bit
(for some files in cp-1252, the Microsoft embrace-and-extend superset of iso8859-1). The binary files aren't text files, and the rest omindex should already handle correctly because it falls back to treating invalid UTF-8 text as cp-1252).
To be useful here we need something which can actually detect non-Unicode encodings, and ideally also which iso8859-N is in use.
comment:13 by , 10 months ago
Version: | SVN trunk → git master |
---|
Re Unicode Normalisation:
I think the workable approach is to provide an "opinionated" implementation where we pick one normalisation and only support that (we essentially do that for encodings - Xapian features which care about an encoding only support UTF-8).
A composed form is probably the more sensible choice here:
- Snowball stemmers all support that and few (maybe none) support decomposed forms
- It makes for smaller terms
- It seems by far the dominant form that data is actually in
That means NFC or NFKC - the latter seems helpful in some cases (e.g. ligatures: "office" -> "office") but less so in others (e.g. "4²" -> "42").
I think this needs a deeper analysis, but possibly we could define a subset of the Unicode compatibility equivalent forms to use here.
Another is word-splitting - currently we split rather simply by just considering certain characters to be "term characters" and allowing certain suffixes and "infixes". Unicode defines rules for identifying words, which we should probably use (probably with a few tweaks - for example, we want "C++" and "C#" and "AT&T" to be terms and the Unicode rules don't seem to count them as words):
http://www.unicode.org/reports/tr29/