Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of FAQ/UnstemmingTerms

Timestamp:: 19/01/09 06:26:21 (16 years ago)
Author:: Olly Betts
Comment:: new FAQ - unstemming

Legend:

: Unmodified
: Added
: Removed
: Modified

FAQ/UnstemmingTerms

               v1
+= How can I reverse the actions of Xapian::Stem to produce a list of words with a given stem? =
+The !QueryParser class has {{{unstem_begin(TERM)}}} and {{{unstem_end(TERM)}}}
+methods which allow iteration over any words in the query string which stemmed
+to a given term.  This is easy to implement by simply recording what words
+led to each stemmed term while parsing the query string.
+There isn't currently a Xapian API feature to produce such a list in other cases.
+If you wanted to add such a feature, there are several approaches you could take
+(this list may not be exhaustive):
+ * Write a set of unstemming algorithms corresponding to each of the stemming
+ algorithms which returns a list of words which stem to a given stem.
+ For an arbitrary stemming algorithm, this isn't actually possible -
+ consider one which removes as many trailing letter "s" as it can
+ (that's {{{s/s+$//}}} as a Perl regexp) which would give an infinite
+ list of possible "unstems".  Fortunately this problem
+ doesn't seem to affect any of the existing stemming algorithms, and
+ it seems like an indication of a poorly designed algorithm.
+ The main practical downside of this approach is it requires writing
+ rather a lot of code and the potential for the behaviour of stemmer
+ and unstemmer not to match.
+ Or you could try to write something
+ which takes a Snowball algorithm and produces the corresponding inverse
+ algorithm, but that's probably a lot harder.
+ * Generate an overly generous list of unstems and cull them by
+ stemming and checking if they match the stemmed word passed in.  This
+ is essentially a simplified way of implementing the first approach,
+ but avoids any possibility of false positives (it could still fail to
+ report a valid unstem though.  Provide the candidate list isn't too
+ long this should perform well.
+ * You could stem all unstemmed terms in the database and
+ report those which match the given stemmed word.  This doesn't
+ produce all unstems - only those actually used (perhaps that's actually
+ more useful though!)
+ The main problem is it's likely to be rather slow.  If you
+ know (by inspecting the algorithm for a particular language) that a word
+ and its stem must share the same first ''N'' letters, then that would help
+ speed this up as you'd only need to check a subset of the terms in the
+ database.  ''N'' is at least 1 for the English stemmer.
+ * Xapian databases could automatically maintain a stem->unstem mapping
+ for terms in the database.  The main problem with this is where to
+ handle it - the backends don't currently care about stemming, but the
+ !TermGenerator class doesn't know when documents are deleted.
+[wiki:FAQ FAQ Index]