Opened 13 years ago

Closed 7 days ago

Last modified 3 days ago

#609 closed enhancement (fixed)

term generation for some French elisions produces imperfect results.

Reported by: Paul Rudin Owned by: Olly Betts
Priority: highest Milestone: 2.0.0
Component: QueryParser Version: git master
Severity: normal Keywords:
Cc: Kelson Blocked By:
Blocking: Operating System: All

Description

Using the xapian.TermGenerator with the standard French stemmer text containing, for example, "l'Etat" gives terms "l'etat" and "Zl'etat". The problem is that if you then search for "etat" you won't get a match but in most cases this is probably what users want.

I suppose that the correct thing would be to stem to etat?

Change History (12)

comment:1 by Olly Betts, 13 years ago

Component: OtherQueryParser
Version: SVN trunk

I guess we need to decide if it is the TermGenerator's job to handle the apostrophe in cases like this, or the stemmer's job to cope with the apostrophe appropriately.

Currently TermGenerator treats apostrophe as a word character, and the English stemmer understands "'s" suffixes, but I don't think any other stemmers do anything special with apostrophes.

And QueryParser needs to match TermGenerator in this regard.

comment:2 by Olly Betts, 6 years ago

Milestone: 1.5.0
Version: SVN trunkgit master

comment:3 by Kelson, 4 years ago

Probably kind of obvious, but this is not only causing a problem for "l'", but as well for "d'" which is really common as well. See this Kiwix ticket https://github.com/openzim/libzim/issues/592 for an other concrete example of the problem.

comment:4 by Kelson, 4 years ago

Cc: Kelson added

comment:5 by Olly Betts, 2 years ago

I think ideally we'd deal with this in Snowball so I've opened an issue there: https://github.com/snowballstem/snowball/issues/187

comment:6 by Olly Betts, 2 years ago

Priority: normalhighest

comment:7 by Olly Betts, 11 months ago

Status: newassigned

I've pushed a change to Snowball to implement this: https://github.com/snowballstem/snowball/commit/664b9893ee16f4d5aa63f9898046f832976f98c4

So far only ASCII apostrophe is handled (') - we ought to handle Unicode apostrophe's too but that's a bit more fiddly because it's not in iso-8859-1 which Snowball upstream still supports. For Xapian we only build UTF-8 stemmers so we could patch this in for now if necessary.

comment:8 by Olly Betts, 3 weeks ago

Milestone: 1.5.02.0.0

Milestone renamed

comment:9 by Olly Betts, 7 days ago

6b6254d397b1088a17fa82648f3cf44c17f97cd1 merges the latest Snowball algorithm versions, and with that we now handle French elisions which use the ASCII apostrophe character.

I think for now we should probably just patch in the Unicode apostrophe characters in Xapian's copy of french.sbl (once that's addressed in upstream Snowball our patch can be dropped).

comment:10 by Olly Betts, 7 days ago

Resolution: fixed
Status: assignedclosed

Actually we already normalise U+2019 and U+201B in the QueryParser and TermGenerator, so we don't need to do anything special in the stemmer for them:

$ examples/quest -s french 'l’etat'
Parsed Query: Query(Zetat@1)

comment:11 by Kelson, 6 days ago

@Olly Thank you very much for this long expected bug fixes. It looks like Xapian 2.0 is almost done, and we should at Kiwix try to compile it to be ready to use it once it has been release!

in reply to:  11 comment:12 by Olly Betts, 3 days ago

Replying to Kelson:

@Olly Thank you very much for this long expected bug fixes. It looks like Xapian 2.0 is almost done, and we should at Kiwix try to compile it to be ready to use it once it has been release!

Yes, we're getting pretty close to 2.0 now.

I'd definitely encourage people to start trying it with existing code. The intention is it's compatible (aside from things which were deprecated in 1.4.0 or before - these have now been removed), but reality doesn't always fall in line and it's better to catch problems before the release as we're then restricted by not wanting to break the ABI.

Note: See TracTickets for help on using tickets.