| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>Xapian: Stemming Algorithms</TITLE> |
|---|
| 5 | </HEAD> |
|---|
| 6 | <BODY BGCOLOR="white"> |
|---|
| 7 | |
|---|
| 8 | <center><h1> |
|---|
| 9 | Stemming Algorithms |
|---|
| 10 | </h1></center> |
|---|
| 11 | |
|---|
| 12 | Xapian uses the |
|---|
| 13 | <A HREF="http://snowball.tartarus.org/">Snowball Stemming Algorithms</A>. |
|---|
| 14 | At present, these support Danish, Dutch, English, Finnish, French, German, |
|---|
| 15 | Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, |
|---|
| 16 | and Turkish. |
|---|
| 17 | There are also implementations of Lovins' English stemmer, Porter's original |
|---|
| 18 | English stemmer, the Kraaij-Pohlmann Dutch stemmer, and a variation of the |
|---|
| 19 | German stemmer which normalises umlauts.<p> |
|---|
| 20 | |
|---|
| 21 | We'd like to add stemmers for more languages - see the Snowball site for |
|---|
| 22 | information on how to contribute.<p> |
|---|
| 23 | |
|---|
| 24 | <h2>What is a stemming algorithm?</h2> |
|---|
| 25 | |
|---|
| 26 | A stemming algorithm is a process of linguistic normalisation, in which the |
|---|
| 27 | variant forms of a word are reduced to a common form, for example, |
|---|
| 28 | <pre> |
|---|
| 29 | connection |
|---|
| 30 | connections |
|---|
| 31 | connective ---> connect |
|---|
| 32 | connected |
|---|
| 33 | connecting |
|---|
| 34 | </pre> |
|---|
| 35 | |
|---|
| 36 | It is important to appreciate that we use stemming with the intention of |
|---|
| 37 | improving the performance of IR systems. It is not an exercise in etymology or |
|---|
| 38 | grammar. In fact from an etymological or grammatical viewpoint, a stemming |
|---|
| 39 | algorithm is liable to make many mistakes. In addition, stemming algorithms - |
|---|
| 40 | at least the ones presented here - are applicable to the written, not the |
|---|
| 41 | spoken, form of the language.<p> |
|---|
| 42 | |
|---|
| 43 | For some of the world's languages, Chinese for example, the concept of stemming |
|---|
| 44 | is not applicable, but it is certainly meaningful for the many languages |
|---|
| 45 | of the Indo-European group. In these languages words tend to be constant at |
|---|
| 46 | the front, and to vary at the end: |
|---|
| 47 | <pre> |
|---|
| 48 | -ion |
|---|
| 49 | -ions |
|---|
| 50 | connect-ive |
|---|
| 51 | -ed |
|---|
| 52 | -ing |
|---|
| 53 | </pre> |
|---|
| 54 | The variable part is the `ending', or `suffix'. Taking these endings off is |
|---|
| 55 | called `suffix stripping' or `stemming', and the residual part is called the |
|---|
| 56 | stem.<p> |
|---|
| 57 | |
|---|
| 58 | <h2>Endings</h2> |
|---|
| 59 | |
|---|
| 60 | Another way of looking at endings and suffixes is to think of the suffix as |
|---|
| 61 | being made up of a number of endings. For example, the French word |
|---|
| 62 | <pre> |
|---|
| 63 | confirmatives |
|---|
| 64 | </pre> |
|---|
| 65 | can be thought of as `confirm' with a chain of endings, |
|---|
| 66 | <pre> |
|---|
| 67 | -atif (adjectival ending - morphological) |
|---|
| 68 | plus -e (feminine ending - grammatical) |
|---|
| 69 | plus -s (plural ending - grammatical) |
|---|
| 70 | </pre> |
|---|
| 71 | -atif can also be thought of as -ate plus -if. Note that the addition of |
|---|
| 72 | endings can cause respellings, so -e changes preceding `f' to `v'.<p> |
|---|
| 73 | |
|---|
| 74 | Endings fall into two classes, grammatical and morphological. The addition of |
|---|
| 75 | -s in English to make a plural is an example of a grammatical ending. The |
|---|
| 76 | word remains of the same type. There is usually only one dictionary entry for |
|---|
| 77 | a word with all its various grammatical endings. Morphological endings create |
|---|
| 78 | new types of word. In English -ise or -ize makes verbs from nouns (`demon', |
|---|
| 79 | `demonise'), -ly makes adverbs from adjectives (`foolish', `foolishly'), and |
|---|
| 80 | so on. Usually there are separate dictionary endings for these creations.<p> |
|---|
| 81 | |
|---|
| 82 | <h2>Language knowledge</h2> |
|---|
| 83 | |
|---|
| 84 | It is much easier to write a stemming algorithm for a language when you |
|---|
| 85 | are familiar with it. If you are not, you will probably need to work with |
|---|
| 86 | someone who is, and who can also explain details of grammar to you. Best |
|---|
| 87 | is a professional teacher or translator. You certainly don't need to have |
|---|
| 88 | a world authority on the grammar of the language. In fact too much |
|---|
| 89 | expertise can get in the way when it comes to the very practical matter of |
|---|
| 90 | writing the stemming algorithm.<p> |
|---|
| 91 | |
|---|
| 92 | <H2>Vocabularies</H2> |
|---|
| 93 | |
|---|
| 94 | Each stemmer is issued with a vocabulary in data/voc.txt, and its stemmed |
|---|
| 95 | form in data/voc.st. You can use these for testing and evaluation purposes. |
|---|
| 96 | |
|---|
| 97 | <h2>Raw materials</h2> |
|---|
| 98 | |
|---|
| 99 | A conventional grammar of a language will list all the grammatical endings, |
|---|
| 100 | and will often summarise most of the morphological endings. A grammar, plus a |
|---|
| 101 | dictionary, are therefore basic references in the development of a stemming |
|---|
| 102 | algorithm, although you can dispense with them if you have an excellent |
|---|
| 103 | knowledge of the language. What you cannot dispense with is a vocabulary to |
|---|
| 104 | try the algorithm out on as it is being developed. Assemble about 2 megabytes |
|---|
| 105 | of text. A mix of sources is best, and literary prose (conventional novels) |
|---|
| 106 | usually gives an ideal mix of tenses, cases, persons, genders etc. Obviously |
|---|
| 107 | the texts should be in some sense 'contemporary', but it is an error to |
|---|
| 108 | exclude anything slightly old. The algorithm itself may well get applied to |
|---|
| 109 | older texts once it has been written. For English, the works of Shakespeare |
|---|
| 110 | in the customary modern spelling make a good test vocabulary.<p> |
|---|
| 111 | |
|---|
| 112 | From the source text derive a control vocabulary of words in sorted order. |
|---|
| 113 | Sample vocabularies in this style are part of our Open Source release. If you |
|---|
| 114 | make a small change to the stemming algorithm you should have a procedure |
|---|
| 115 | that presents the change as a three column table: column one is the control |
|---|
| 116 | vocabulary, column 2 the stemmed equivalent, and column 3 the stemmed |
|---|
| 117 | equivalent after the change has been made to the algorithm. The effects of |
|---|
| 118 | the change can be evaluated by looking at the differences between columns two |
|---|
| 119 | and three.<p> |
|---|
| 120 | |
|---|
| 121 | The first job is to come up with a list of endings. This can be done by |
|---|
| 122 | referring to the grammar, the dictionary, and also by browsing through the |
|---|
| 123 | control vocabulary.<p> |
|---|
| 124 | |
|---|
| 125 | <h2>Rules for removing endings</h2> |
|---|
| 126 | |
|---|
| 127 | If a word has an ending, E, when should E be removed? Various criteria come |
|---|
| 128 | into play here. One is the knowledge we have about the word from other |
|---|
| 129 | endings that might have been removed. If a word ends with a grammatical verb |
|---|
| 130 | ending, and that has been removed, then we have a verb form, and the only |
|---|
| 131 | further endings to consider are morphological endings that create verbs from |
|---|
| 132 | other word types. At this level the system of endings gives rise to a small |
|---|
| 133 | state table, which can be followed in devising the algorithm. In Latin |
|---|
| 134 | derived languages, there is a state table of morphological endings that |
|---|
| 135 | roughly looks like this: |
|---|
| 136 | <pre> |
|---|
| 137 | -IC (adj) -+-> -ATION (noun) |
|---|
| 138 | +-> -ITY (noun) |
|---|
| 139 | +-> -MENT (adv) |
|---|
| 140 | \-> -AT (verb) -+-> -IV (adj) -+-> -ITY (noun) |
|---|
| 141 | | \-> -MENT (adv) |
|---|
| 142 | \-> -OR (noun) |
|---|
| 143 | |
|---|
| 144 | -ABLE (adj) -+-> -ITY (noun) |
|---|
| 145 | \-> -MENT (adv) |
|---|
| 146 | |
|---|
| 147 | -OUS (adj) ---> -MENT (adv) |
|---|
| 148 | </pre> |
|---|
| 149 | The ending forms take different values in different languages. In French, -OR |
|---|
| 150 | becomes `-eur' (m.) or `-rice' (f.), -AT disappears into the infinitive form |
|---|
| 151 | of a verb. In English, -MENT becomes `-ly', and then one can recognise, |
|---|
| 152 | <pre> |
|---|
| 153 | -IC-ATION fortification |
|---|
| 154 | -IC-ITY electricity |
|---|
| 155 | -IC-MENT fantastically |
|---|
| 156 | -AT-IV contemplative |
|---|
| 157 | -AT-OR conspirator |
|---|
| 158 | -IV-ITY relativity |
|---|
| 159 | -IV-MENT instinctively |
|---|
| 160 | -ABLE-ITY incapability |
|---|
| 161 | -ABLE-MENT charitably |
|---|
| 162 | -OUS-MENT famously |
|---|
| 163 | </pre> |
|---|
| 164 | Trios, -IC-AT-IV etc., also occur, but sequences of length four, |
|---|
| 165 | -IC-AT-IV-ITY and -IC-AT-IV-MENT, are absent (or occur very rarely).<p> |
|---|
| 166 | |
|---|
| 167 | Sometimes the validity of an ending depends on the immediately preceding |
|---|
| 168 | group of letters. In Italian, for example, certain pronouns and pronoun |
|---|
| 169 | groups attach to present participle and infinitive forms of verbs, for |
|---|
| 170 | example, |
|---|
| 171 | <blockquote> |
|---|
| 172 | scrivendole = scrivendo (writing) + le (to her)<br> |
|---|
| 173 | mandarglielo = mandare (to give) + glielo (it to him) |
|---|
| 174 | </blockquote> |
|---|
| 175 | If E is the ending, the possible forms are -andoE, -endoE, -arE, -erE, -irE, |
|---|
| 176 | so E is removed in the context -Xndo or Yr, where X is a or e, and Y is a or |
|---|
| 177 | e or i. See the <code>attached_pronoun</code> procedure in the Italian |
|---|
| 178 | stemmer.<p> |
|---|
| 179 | |
|---|
| 180 | The most useful criterion for removing an ending, however, is to base the |
|---|
| 181 | decision on the syllable length of the stem that will remain. This idea was |
|---|
| 182 | first used in the English stemming algorithm, and has been found to be |
|---|
| 183 | applicable in the other stemming algorithms too. If C stands for a sequence |
|---|
| 184 | of consonants, and V for a sequence of vowels, any word can be analysed as, |
|---|
| 185 | <pre> |
|---|
| 186 | [C] V C ... V C [V] |
|---|
| 187 | </pre> |
|---|
| 188 | where [..] indicates arbitrary presence, and V C ... V C means V C repeated |
|---|
| 189 | zero or more |
|---|
| 190 | times. We can find successive positions 0, 1, 2 ... in a word corresponding |
|---|
| 191 | to each vowel-consonant stretch V C, |
|---|
| 192 | <pre> |
|---|
| 193 | t h u n d e r s t r i c k e n |
|---|
| 194 | 0 1 2 3 4 |
|---|
| 195 | </pre> |
|---|
| 196 | The closer E is to the beginning of the word, the more unwilling we should be |
|---|
| 197 | remove it. So we might have a rule to remove E if at is after position 2, and |
|---|
| 198 | so on.<p> |
|---|
| 199 | |
|---|
| 200 | <h2>Developing the algorithm</h2> |
|---|
| 201 | |
|---|
| 202 | Build the algorithm up bit by bit, trying out a small number of ending |
|---|
| 203 | removals at a time. For each new ending plus rule added, decide whether, on |
|---|
| 204 | average, the stemming process is improved or degraded. If it is degraded the |
|---|
| 205 | rule is unhelpful and can be discarded.<p> |
|---|
| 206 | |
|---|
| 207 | This sounds like common sense, but it is actually very easy to fall into the |
|---|
| 208 | trap of endlessly elaborating the rules without looking at their true effect. |
|---|
| 209 | What you find eventually is that you can be improving performance in one area |
|---|
| 210 | of the vocabulary, while causing a similar degradation of performance in |
|---|
| 211 | another area. When this happens consistently it is time to call a halt to |
|---|
| 212 | development and to regard the stemming algorithm as finished.<p> |
|---|
| 213 | |
|---|
| 214 | It is important to realise that the stemming process cannot be made perfect. |
|---|
| 215 | For example, in French, the simple verb endings -ons and -ent of the present |
|---|
| 216 | tense occur repeatedly in other contexts. -ons is the plural form of all nouns |
|---|
| 217 | ending -on, and -ent is also a common noun ending. On balance it is best not |
|---|
| 218 | to remove these endings. In practice this affects -ent verb endings more than |
|---|
| 219 | -ons verb endings, since -ent endings are commoner. The result is that verbs |
|---|
| 220 | stem not to a single form, but to a much smaller number of forms (three), |
|---|
| 221 | among which the form given by the true stem of the verb is by far the |
|---|
| 222 | commonest.<p> |
|---|
| 223 | |
|---|
| 224 | If we define errors A and B by, |
|---|
| 225 | <blockquote> |
|---|
| 226 | error A: removing an ending when it is not an ending<br> |
|---|
| 227 | error B: not removing an ending when it is an ending |
|---|
| 228 | </blockquote> |
|---|
| 229 | Then removing -ent leads to error A; not removing -ent leads to error B. We |
|---|
| 230 | must adopt the rule that minimises the number of errors - not the rule that |
|---|
| 231 | appears to be the most elegant.<p> |
|---|
| 232 | |
|---|
| 233 | <h2>Irregular forms</h2> |
|---|
| 234 | |
|---|
| 235 | Linguistic irregularities slip through the net of a stemming algorithm. The |
|---|
| 236 | English stemmer stems `cows' to `cow', but does not stem `oxen' to `ox'. In |
|---|
| 237 | reality this matters much less than one might suppose. In English, the |
|---|
| 238 | irregular plurals tend to be of things that were common in Anglo-Saxon |
|---|
| 239 | England: oxen, sheep, mice, dice - and lice. Men, women and children are of |
|---|
| 240 | course common today, but the very commonness of these words makes them of |
|---|
| 241 | less importance in IR systems. Similar remarks may be said about irregular |
|---|
| 242 | verbs in English, the total number of which is around 150. Here, the fact |
|---|
| 243 | that verbs are used perhaps rather less than nouns and adjectives in IR |
|---|
| 244 | queries helps account for the unimportance of verb irregularities in IR |
|---|
| 245 | performance. There are in English more significant irregularities in |
|---|
| 246 | morphological changes such as `receive' to `reception', `decide' to |
|---|
| 247 | `decision' etc., which correspond, ultimately, to irregularities in the Latin |
|---|
| 248 | verbs from which these words derive. But again working IR systems are rarely |
|---|
| 249 | upset by lack of resolution of these forms.<p> |
|---|
| 250 | |
|---|
| 251 | An irregularity of English which does cause a problem is the word `news'. It |
|---|
| 252 | is now a singular noun, and is never regarded as the plural of `new'. This, |
|---|
| 253 | and a few more howlers, are placed in a table, <code>irregular_forms</code>, in the |
|---|
| 254 | English stemming algorithm. Similar tables are provided in the other stemming |
|---|
| 255 | algorithms, with some provisional entries. The non-English stemming |
|---|
| 256 | algorithms have not been used enough for a significant list of irregular |
|---|
| 257 | forms to emerge, but as they do, they can be placed in the <code>irregular_forms</code> |
|---|
| 258 | table.<p> |
|---|
| 259 | |
|---|
| 260 | <h2>Using stemming in IR</h2> |
|---|
| 261 | |
|---|
| 262 | In earlier implementations of IR systems, the words of a text were |
|---|
| 263 | usually stemmed as part of the indexing process, and the stemmed forms |
|---|
| 264 | only held in the main IR index. The words of each incoming query would |
|---|
| 265 | then be stemmed similarly. When the index terms were seen by the |
|---|
| 266 | user, for example during query expansion, they would be seen in their |
|---|
| 267 | stemmed form. It was important therefore that the stemmed form of a |
|---|
| 268 | word should not be too unfamiliar in appearance. A user will be |
|---|
| 269 | comfortable with seeing `apprehend', which stands for 'apprehending', |
|---|
| 270 | `apprehended' as well as `apprehend'. More problematical is |
|---|
| 271 | `apprehens', standing for `apprehension', `apprehensive' etc., but |
|---|
| 272 | even so, a trained user would not have a problem with this. In fact |
|---|
| 273 | all the Xapian stemming algorithms are built on the assumption that it |
|---|
| 274 | leave stemmed forms which it would not be embarrassing to show to real |
|---|
| 275 | users, and we suggest that new stemming algorithms are designed with |
|---|
| 276 | this criterion in mind.<p> |
|---|
| 277 | |
|---|
| 278 | A superior approach is to keep each word, <i>W</i>, and its stemmed form, |
|---|
| 279 | <i>s(W)</i>, as a two-way relation in the IR system. <i>W</i> is held in |
|---|
| 280 | the index with its own posting list. <i>s(W)</i> could have its separate |
|---|
| 281 | posting list, but this would be derivable from the class of words that |
|---|
| 282 | stem to <i>s(W)</i>. The important thing is to have the <i>W</i> <-> |
|---|
| 283 | <i>s(W)</i> relation. From <i>W</i> we can derive <i>s(W)</i>, the stemmed |
|---|
| 284 | form. From a stemmed form <i>s(W)</i> we can derive <i>W</i> plus the |
|---|
| 285 | other words in the IR system which stem to <i>s(W)</i>. Any word can then |
|---|
| 286 | be searched on either stemmed or unstemmed. If the stemmed form of a word |
|---|
| 287 | needs to be shown to the user, it can be represented by the commonest |
|---|
| 288 | among the words which stem to that form.<p> |
|---|
| 289 | |
|---|
| 290 | <h2>Stopwords</h2> |
|---|
| 291 | |
|---|
| 292 | It has been traditional in setting up IR systems to discard the very |
|---|
| 293 | commonest words of a language - the stopwords - during indexing. |
|---|
| 294 | A more modern approach is to index everything, which greatly assists |
|---|
| 295 | searching for phrases for example. Stopwords can then still be eliminated from the |
|---|
| 296 | query as an optional style of retrieval. In either case, a list of |
|---|
| 297 | stopwords for a language is useful.<p> |
|---|
| 298 | |
|---|
| 299 | Getting a |
|---|
| 300 | list of stopwords can be done by sorting a vocabulary of a text corpus for |
|---|
| 301 | a language by frequency, and going down the list picking off words to be |
|---|
| 302 | discarded.<p> |
|---|
| 303 | |
|---|
| 304 | The stopword list connects in various |
|---|
| 305 | ways with the stemming algorithm:<p> |
|---|
| 306 | |
|---|
| 307 | The stemming algorithm can itself be used to detect and remove stopwords. One |
|---|
| 308 | would add into the <code>irregular_forms</code> table something like this, |
|---|
| 309 | <pre> |
|---|
| 310 | "", /* null string */ |
|---|
| 311 | |
|---|
| 312 | "am/is/are/be/being/been/" /* BE */ |
|---|
| 313 | "have/has/having/had/" /* HAD */ |
|---|
| 314 | "do/does/doing/did/" /* DID */ |
|---|
| 315 | ... /* multi-line string */ |
|---|
| 316 | </pre> |
|---|
| 317 | so that the words `am', `is' etc. map to the null string (or some other |
|---|
| 318 | easily recognised value).<p> |
|---|
| 319 | |
|---|
| 320 | Alternatively, stopwords could be removed before the stemming algorithm is |
|---|
| 321 | applied, or after the stemming algorithm is applied. In this latter case, the |
|---|
| 322 | words to be removed must themselves have gone through the stemmer, and the |
|---|
| 323 | number of distinct forms will be greatly reduced as a result. In Italian for |
|---|
| 324 | example, the four forms |
|---|
| 325 | <pre> |
|---|
| 326 | questa queste questi questo |
|---|
| 327 | </pre> |
|---|
| 328 | (meaning `that') all stem to |
|---|
| 329 | <pre> |
|---|
| 330 | quest |
|---|
| 331 | </pre> |
|---|
| 332 | |
|---|
| 333 | In the xapian-data directory in the SVN repository, each language represented |
|---|
| 334 | in the stemming section has, in addition to a large test vocabulary, a useful |
|---|
| 335 | stopword list in both source and stemmed form. The source form, in |
|---|
| 336 | the file <code>stopsource</code>, is carefully annotated, and the derived |
|---|
| 337 | file, <code>stopwords</code>, contains an equivalent list of sorted, stemmed, |
|---|
| 338 | stopwords.<p> |
|---|
| 339 | <!-- FOOTER $Author$ $Date$ $Id$ --> |
|---|
| 340 | </BODY> |
|---|
| 341 | </HTML> |
|---|