Opened 14 years ago

Closed 13 years ago

#507 closed defect (notabug)

Some little problems with the french stemmer

Reported by: Versmisse David Owned by: Olly Betts
Priority: normal Milestone:
Component: Library API Version: 1.2.3
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Hello,

Here, a little list of few problems with the french stemmer that we found:

For nouns with "...e", the e is removed by example: poule (chicken) => poul (must be poule) it's good for an adjectiv, but not for a noun.

And it's the same thing with the nouns with "...lle" or "...tte", by example, brouette (wheelbarrow) => brouet (must be brouette)

I understand, it's a problem, because the same rule cannot be applied for nouns and adjectives. With the current solution (xapian 1.2.3), we get too many solutions with a search, ie "false positive", so it's better than "false negative".

Have you got a file to test the stemmer? We can help you to fill this file.

Best regards,

  1. Versmisse.

Change History (2)

comment:1 by Olly Betts, 14 years ago

Component: OtherLibrary API
Version: 1.2.3

I think what you're describing is a feature rather than a bug.

The stems which are produced aren't necessarily actual words, but rather tokens which look rather like the words associated with that stem.

For example, in English early stems to earli which isn't a real word. But this doesn't matter, as what is important is that earlier also stems to earli.

Section 5 of http://snowball.tartarus.org/texts/introduction.html discusses this:

A question arises: if the user never sees the stemmed form, does its appearance matter? The answer must be no, although the Porter stemmer tries to make the unstemmed forms guessable from the stemmed forms. For example, from appropri you can guess appropriate. At least, trying to achieve this effect acts as a useful control. Similarly with the other stemmers presented here, an attempt has been made to keep the appearance of the stemmed forms as familiar as possible.

If the stemmer is producing the same stem for words which should have different stems (or different stems for cases which should be the same) then it would be more efficient to report this directly to the Snowball developers. Snowball is the project which maintains these algorithms - see http://snowball.tartarus.org/

There's test data for the stemmers in SVN under browser:trunk/xapian-data

comment:2 by Olly Betts, 13 years ago

Resolution: notabug
Status: newclosed
Type: enhancementdefect

Closing as "notabug" - as I explained in the previous comment, I think this is working as intended.

Note: See TracTickets for help on using tickets.