| 1 | = How can I reverse the actions of Xapian::Stem to produce a list of words with a given stem? = |
| 2 | |
| 3 | The !QueryParser class has {{{unstem_begin(TERM)}}} and {{{unstem_end(TERM)}}} |
| 4 | methods which allow iteration over any words in the query string which stemmed |
| 5 | to a given term. This is easy to implement by simply recording what words |
| 6 | led to each stemmed term while parsing the query string. |
| 7 | |
| 8 | There isn't currently a Xapian API feature to produce such a list in other cases. |
| 9 | |
| 10 | If you wanted to add such a feature, there are several approaches you could take |
| 11 | (this list may not be exhaustive): |
| 12 | |
| 13 | * Write a set of unstemming algorithms corresponding to each of the stemming |
| 14 | algorithms which returns a list of words which stem to a given stem. |
| 15 | |
| 16 | For an arbitrary stemming algorithm, this isn't actually possible - |
| 17 | consider one which removes as many trailing letter "s" as it can |
| 18 | (that's {{{s/s+$//}}} as a Perl regexp) which would give an infinite |
| 19 | list of possible "unstems". Fortunately this problem |
| 20 | doesn't seem to affect any of the existing stemming algorithms, and |
| 21 | it seems like an indication of a poorly designed algorithm. |
| 22 | |
| 23 | The main practical downside of this approach is it requires writing |
| 24 | rather a lot of code and the potential for the behaviour of stemmer |
| 25 | and unstemmer not to match. |
| 26 | |
| 27 | Or you could try to write something |
| 28 | which takes a Snowball algorithm and produces the corresponding inverse |
| 29 | algorithm, but that's probably a lot harder. |
| 30 | |
| 31 | * Generate an overly generous list of unstems and cull them by |
| 32 | stemming and checking if they match the stemmed word passed in. This |
| 33 | is essentially a simplified way of implementing the first approach, |
| 34 | but avoids any possibility of false positives (it could still fail to |
| 35 | report a valid unstem though. Provide the candidate list isn't too |
| 36 | long this should perform well. |
| 37 | |
| 38 | * You could stem all unstemmed terms in the database and |
| 39 | report those which match the given stemmed word. This doesn't |
| 40 | produce all unstems - only those actually used (perhaps that's actually |
| 41 | more useful though!) |
| 42 | |
| 43 | The main problem is it's likely to be rather slow. If you |
| 44 | know (by inspecting the algorithm for a particular language) that a word |
| 45 | and its stem must share the same first ''N'' letters, then that would help |
| 46 | speed this up as you'd only need to check a subset of the terms in the |
| 47 | database. ''N'' is at least 1 for the English stemmer. |
| 48 | |
| 49 | * Xapian databases could automatically maintain a stem->unstem mapping |
| 50 | for terms in the database. The main problem with this is where to |
| 51 | handle it - the backends don't currently care about stemming, but the |
| 52 | !TermGenerator class doesn't know when documents are deleted. |
| 53 | |
| 54 | [wiki:FAQ FAQ Index] |