#609 closed enhancement (fixed)
term generation for some French elisions produces imperfect results.
| Reported by: | Paul Rudin | Owned by: | Olly Betts |
|---|---|---|---|
| Priority: | highest | Milestone: | 2.0.0 |
| Component: | QueryParser | Version: | git master |
| Severity: | normal | Keywords: | |
| Cc: | Kelson | Blocked By: | |
| Blocking: | Operating System: | All |
Description
Using the xapian.TermGenerator with the standard French stemmer text containing, for example, "l'Etat" gives terms "l'etat" and "Zl'etat". The problem is that if you then search for "etat" you won't get a match but in most cases this is probably what users want.
I suppose that the correct thing would be to stem to etat?
Change History (12)
comment:1 by , 13 years ago
| Component: | Other → QueryParser |
|---|---|
| Version: | → SVN trunk |
comment:2 by , 6 years ago
| Milestone: | → 1.5.0 |
|---|---|
| Version: | SVN trunk → git master |
comment:3 by , 4 years ago
Probably kind of obvious, but this is not only causing a problem for "l'", but as well for "d'" which is really common as well. See this Kiwix ticket https://github.com/openzim/libzim/issues/592 for an other concrete example of the problem.
comment:4 by , 4 years ago
| Cc: | added |
|---|
comment:5 by , 2 years ago
I think ideally we'd deal with this in Snowball so I've opened an issue there: https://github.com/snowballstem/snowball/issues/187
comment:6 by , 2 years ago
| Priority: | normal → highest |
|---|
comment:7 by , 11 months ago
| Status: | new → assigned |
|---|
I've pushed a change to Snowball to implement this: https://github.com/snowballstem/snowball/commit/664b9893ee16f4d5aa63f9898046f832976f98c4
So far only ASCII apostrophe is handled (') - we ought to handle Unicode apostrophe's too but that's a bit more fiddly because it's not in iso-8859-1 which Snowball upstream still supports. For Xapian we only build UTF-8 stemmers so we could patch this in for now if necessary.
comment:9 by , 7 days ago
6b6254d397b1088a17fa82648f3cf44c17f97cd1 merges the latest Snowball algorithm versions, and with that we now handle French elisions which use the ASCII apostrophe character.
I think for now we should probably just patch in the Unicode apostrophe characters in Xapian's copy of french.sbl (once that's addressed in upstream Snowball our patch can be dropped).
comment:10 by , 7 days ago
| Resolution: | → fixed |
|---|---|
| Status: | assigned → closed |
Actually we already normalise U+2019 and U+201B in the QueryParser and TermGenerator, so we don't need to do anything special in the stemmer for them:
$ examples/quest -s french 'l’etat' Parsed Query: Query(Zetat@1)
follow-up: 12 comment:11 by , 6 days ago
@Olly Thank you very much for this long expected bug fixes. It looks like Xapian 2.0 is almost done, and we should at Kiwix try to compile it to be ready to use it once it has been release!
comment:12 by , 3 days ago
Replying to Kelson:
@Olly Thank you very much for this long expected bug fixes. It looks like Xapian 2.0 is almost done, and we should at Kiwix try to compile it to be ready to use it once it has been release!
Yes, we're getting pretty close to 2.0 now.
I'd definitely encourage people to start trying it with existing code. The intention is it's compatible (aside from things which were deprecated in 1.4.0 or before - these have now been removed), but reality doesn't always fall in line and it's better to catch problems before the release as we're then restricted by not wanting to break the ABI.

I guess we need to decide if it is the TermGenerator's job to handle the apostrophe in cases like this, or the stemmer's job to cope with the apostrophe appropriately.
Currently TermGenerator treats apostrophe as a word character, and the English stemmer understands "'s" suffixes, but I don't think any other stemmers do anything special with apostrophes.
And QueryParser needs to match TermGenerator in this regard.