Changes between Initial Version and Version 1 of FAQ/UnstemmingTerms


Ignore:
Timestamp:
19/01/09 06:26:21 (16 years ago)
Author:
Olly Betts
Comment:

new FAQ - unstemming

Legend:

Unmodified
Added
Removed
Modified
  • FAQ/UnstemmingTerms

    v1 v1  
     1= How can I reverse the actions of Xapian::Stem to produce a list of words with a given stem? =
     2
     3The !QueryParser class has {{{unstem_begin(TERM)}}} and {{{unstem_end(TERM)}}}
     4methods which allow iteration over any words in the query string which stemmed
     5to a given term.  This is easy to implement by simply recording what words
     6led to each stemmed term while parsing the query string.
     7
     8There isn't currently a Xapian API feature to produce such a list in other cases.
     9
     10If you wanted to add such a feature, there are several approaches you could take
     11(this list may not be exhaustive):
     12
     13 * Write a set of unstemming algorithms corresponding to each of the stemming
     14 algorithms which returns a list of words which stem to a given stem.
     15   
     16 For an arbitrary stemming algorithm, this isn't actually possible -
     17 consider one which removes as many trailing letter "s" as it can
     18 (that's {{{s/s+$//}}} as a Perl regexp) which would give an infinite
     19 list of possible "unstems".  Fortunately this problem
     20 doesn't seem to affect any of the existing stemming algorithms, and
     21 it seems like an indication of a poorly designed algorithm.
     22
     23 The main practical downside of this approach is it requires writing
     24 rather a lot of code and the potential for the behaviour of stemmer
     25 and unstemmer not to match.
     26
     27 Or you could try to write something
     28 which takes a Snowball algorithm and produces the corresponding inverse
     29 algorithm, but that's probably a lot harder.
     30
     31 * Generate an overly generous list of unstems and cull them by
     32 stemming and checking if they match the stemmed word passed in.  This
     33 is essentially a simplified way of implementing the first approach,
     34 but avoids any possibility of false positives (it could still fail to
     35 report a valid unstem though.  Provide the candidate list isn't too
     36 long this should perform well.
     37
     38 * You could stem all unstemmed terms in the database and
     39 report those which match the given stemmed word.  This doesn't
     40 produce all unstems - only those actually used (perhaps that's actually
     41 more useful though!)
     42
     43 The main problem is it's likely to be rather slow.  If you
     44 know (by inspecting the algorithm for a particular language) that a word
     45 and its stem must share the same first ''N'' letters, then that would help
     46 speed this up as you'd only need to check a subset of the terms in the
     47 database.  ''N'' is at least 1 for the English stemmer.
     48
     49 * Xapian databases could automatically maintain a stem->unstem mapping
     50 for terms in the database.  The main problem with this is where to
     51 handle it - the backends don't currently care about stemming, but the
     52 !TermGenerator class doesn't know when documents are deleted.
     53
     54[wiki:FAQ FAQ Index]