Opened 12 years ago

Last modified 2 months ago

#609 assigned enhancement

term generation for some French elisions produces imperfect results.

Reported by: Paul Rudin Owned by: Olly Betts
Priority: highest Milestone: 1.5.0
Component: QueryParser Version: git master
Severity: normal Keywords:
Cc: Kelson Blocked By:
Blocking: Operating System: All

Description

Using the xapian.TermGenerator with the standard French stemmer text containing, for example, "l'Etat" gives terms "l'etat" and "Zl'etat". The problem is that if you then search for "etat" you won't get a match but in most cases this is probably what users want.

I suppose that the correct thing would be to stem to etat?

Change History (7)

comment:1 by Olly Betts, 12 years ago

Component: OtherQueryParser
Version: SVN trunk

I guess we need to decide if it is the TermGenerator's job to handle the apostrophe in cases like this, or the stemmer's job to cope with the apostrophe appropriately.

Currently TermGenerator treats apostrophe as a word character, and the English stemmer understands "'s" suffixes, but I don't think any other stemmers do anything special with apostrophes.

And QueryParser needs to match TermGenerator in this regard.

comment:2 by Olly Betts, 5 years ago

Milestone: 1.5.0
Version: SVN trunkgit master

comment:3 by Kelson, 4 years ago

Probably kind of obvious, but this is not only causing a problem for "l'", but as well for "d'" which is really common as well. See this Kiwix ticket https://github.com/openzim/libzim/issues/592 for an other concrete example of the problem.

comment:4 by Kelson, 4 years ago

Cc: Kelson added

comment:5 by Olly Betts, 16 months ago

I think ideally we'd deal with this in Snowball so I've opened an issue there: https://github.com/snowballstem/snowball/issues/187

comment:6 by Olly Betts, 16 months ago

Priority: normalhighest

comment:7 by Olly Betts, 2 months ago

Status: newassigned

I've pushed a change to Snowball to implement this: https://github.com/snowballstem/snowball/commit/664b9893ee16f4d5aa63f9898046f832976f98c4

So far only ASCII apostrophe is handled (') - we ought to handle Unicode apostrophe's too but that's a bit more fiddly because it's not in iso-8859-1 which Snowball upstream still supports. For Xapian we only build UTF-8 stemmers so we could patch this in for now if necessary.

Note: See TracTickets for help on using tickets.