Opened 18 years ago

Last modified 2 weeks ago

#150 assigned enhancement

Enhancements to Unicode support

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: QueryParser Version: git master
Severity: minor Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

This bug is intended to just gather together enhancements we'd like to make to our Unicode support.

Currently I'm aware of:

  • Unicode has rules for indentifying word boundaries, which we should investigate and perhaps use more of. For example, we currently handle a space followed by a non-spacing mark wrongly.

I'd imagine we would probably want to target most such changes at a ".0" release, for reasons of database compatibility. There are probably cases where it would be reasonable to implement such changes sooner though - if we build a different database in a case where the existing behaviour is poor, or the difference isn't problematic for some other reason, say.

Change History (13)

comment:1 by Olly Betts, 18 years ago

Status: newassigned

Another is word-splitting - currently we split rather simply by just considering certain characters to be "term characters" and allowing certain suffixes and "infixes". Unicode defines rules for identifying words, which we should probably use (probably with a few tweaks - for example, we want "C++" and "C#" and "AT&T" to be terms and the Unicode rules don't seem to count them as words):

http://www.unicode.org/reports/tr29/

comment:2 by Olly Betts, 18 years ago

Blocking: 160 added
Operating System: All

This is mostly (if not all) 1.1.0 material, so set to block bug#160.

comment:3 by Olly Betts, 18 years ago

Two items from Utf8Support on the wiki:

Perhaps scriptindex should support converting text from other encodings to UTF-8? This could be implemented in a backward compatible way in 1.0.x.

omindex assumes text files are UTF-8 (although the UTF-8 parsing falls back to ISO-8859-1 for invalid UTF-8 sequences and is used for both term and sample generation). But we could use "libmagic" to do "charset detection" (see also bug#114).

comment:4 by Olly Betts, 18 years ago

Provide a way to specify the output encoding for Omega.

comment:6 by Richard Boulton, 17 years ago

Description: modified (diff)
Milestone: 1.1

comment:7 by Richard Boulton, 17 years ago

Blocking: 160 removed

comment:8 by Olly Betts, 16 years ago

Milestone: 1.1.02.0.0

Pushing back to milestone:2.0.0 though that might mean 1.3.0 development for a 1.4.0 release - really I'm just saying "not for 1.1.x or 1.2.x".

Update to comment:4 - omindex now checks for a BOM in text files.

comment:9 by Olly Betts, 16 years ago

Description: modified (diff)

comment:10 by Olly Betts, 16 years ago

Description: modified (diff)

comment:11 by Dirk-Jan C. Binnema, 14 years ago

FYI, I'm using Xapian, and I 'flatten' (normalize) strings before adding them as terms; my table-based implementation:

http://gitorious.org/mu/mu-ng/blobs/master/src/mu-str-normalize.c

it's sufficient for most latin-based accented character, and the strong point (for speed/mem usage) is that it can flatten the strings _in place_.

For a more complete (and shorter) version, some of equivalent of g_str_normalize could be used, where first the accents and strings are separated, and after that the accent chars are removed.

comment:12 by Olly Betts, 19 months ago

omindex assumes text files are UTF-8 (although the UTF-8 parsing falls back to ISO-8859-1 for invalid UTF-8 sequences and is used for both term and sample generation). But we could use "libmagic" to do "charset detection"

I had a quick look at doing so, but basically libmagic isn't actually useful for what we want - it seems to either say binary, us-ascii, iso-8859-1, utf-8 or unknown-8bit (for some files in cp-1252, the Microsoft embrace-and-extend superset of iso8859-1). The binary files aren't text files, and the rest omindex should already handle correctly because it falls back to treating invalid UTF-8 text as cp-1252).

To be useful here we need something which can actually detect non-Unicode encodings, and ideally also which iso8859-N is in use.

comment:13 by Olly Betts, 11 months ago

Version: SVN trunkgit master

Re Unicode Normalisation:

I think the workable approach is to provide an "opinionated" implementation where we pick one normalisation and only support that (we essentially do that for encodings - Xapian features which care about an encoding only support UTF-8).

A composed form is probably the more sensible choice here:

  • Snowball stemmers all support that and few (maybe none) support decomposed forms
  • It makes for smaller terms
  • It seems by far the dominant form that data is actually in

That means NFC or NFKC - the latter seems helpful in some cases (e.g. ligatures: "office" -> "office") but less so in others (e.g. "4²" -> "42").

I think this needs a deeper analysis, but possibly we could define a subset of the Unicode compatibility equivalent forms to use here.

comment:14 by Olly Betts, 2 weeks ago

I think this needs a deeper analysis, but possibly we could define a subset of the Unicode compatibility equivalent forms to use here.

Thinking about this more, defining our own subset is unhelpful - user code can use existing libraries (or language support) to convert to NFC or NFKC but to get "Xapian-NFC" they'd need to write their own conversion code, or (more sensibly) we'd need to provide conversion functionality. If we're going to have to provide it, it seems better to just convert the cases we want converted internally - so we might pick NFC and then internally handle cases such as the "ffi" ligature.

Note: See TracTickets for help on using tickets.