wiki:FAQ/UnstemmingTerms

How can I reverse the actions of Xapian::Stem to produce a list of words with a given stem?

The QueryParser class has unstem_begin(TERM) and unstem_end(TERM) methods which allow iteration over any words in the query string which stemmed to a given term. This is easy to implement by simply recording what words led to each stemmed term while parsing the query string.

There isn't currently a Xapian API feature to produce such a list in other cases.

If you wanted to add such a feature, there are several approaches you could take (this list may not be exhaustive):

  • Write a set of unstemming algorithms corresponding to each of the stemming algorithms which returns a list of words which stem to a given stem.

For an arbitrary stemming algorithm, this isn't actually possible - consider one which removes as many trailing letter "s" as it can (that's s/s+$// as a Perl regexp) which would give an infinite list of possible "unstems". Fortunately this problem doesn't seem to affect any of the existing stemming algorithms, and it seems like an indication of a poorly designed algorithm.

The main practical downside of this approach is it requires writing rather a lot of code and the potential for the behaviour of stemmer and unstemmer not to match.

Or you could try to write something which takes a Snowball algorithm and produces the corresponding inverse algorithm, but that's probably a lot harder.

  • Generate an overly generous list of unstems and cull them by stemming and checking if they match the stemmed word passed in. This is essentially a simplified way of implementing the first approach, but avoids any possibility of false positives (it could still fail to report a valid unstem though. Provide the candidate list isn't too long this should perform well.
  • You could stem all unstemmed terms in the database and report those which match the given stemmed word. This doesn't produce all unstems - only those actually used (perhaps that's actually more useful though!)

The main problem is it's likely to be rather slow. If you know (by inspecting the algorithm for a particular language) that a word and its stem must share the same first N letters, then that would help speed this up as you'd only need to check a subset of the terms in the database. N is at least 1 for the English stemmer.

  • Xapian databases could automatically maintain a stem->unstem mapping for terms in the database. The main problem with this is where to handle it - the backends don't currently care about stemming, but the TermGenerator class doesn't know when documents are deleted.

FAQ Index

Last modified 10 years ago Last modified on 19/01/09 06:26:21