| 1 | |
|---|
| 2 | .. Copyright (C) 2007 Olly Betts |
|---|
| 3 | |
|---|
| 4 | ========================== |
|---|
| 5 | Xapian Spelling Correction |
|---|
| 6 | ========================== |
|---|
| 7 | |
|---|
| 8 | .. contents:: Table of contents |
|---|
| 9 | |
|---|
| 10 | Introduction |
|---|
| 11 | ============ |
|---|
| 12 | |
|---|
| 13 | Xapian provides functionality which can suggest corrections for misspelled |
|---|
| 14 | words in queries, or in other situations where it might be useful. The |
|---|
| 15 | suggestions are generated dynamically from the data that has been indexed, so |
|---|
| 16 | the correction facility isn't tied to particular languages, and it can suggest |
|---|
| 17 | proper nouns or specialist technical terms. |
|---|
| 18 | |
|---|
| 19 | Algorithm |
|---|
| 20 | ========= |
|---|
| 21 | |
|---|
| 22 | A list of candidate words is generated by matching trigrams (groups of 3 |
|---|
| 23 | adjacent characters) in the candidates against those in the misspelled |
|---|
| 24 | word. As well as groups of adjacent characters, "starts" and "ends" |
|---|
| 25 | are generated with the first two and last two characters respectively |
|---|
| 26 | (e.g. "FISH" generates: "<start>FI", "FIS", "ISH", and "SH<end>"). |
|---|
| 27 | |
|---|
| 28 | This technique alone would missing many single-edit errors in two and three |
|---|
| 29 | character words, so we handle these specially as follows: |
|---|
| 30 | |
|---|
| 31 | For a three character word (e.g. "ABC"), we generate trigrams for the two |
|---|
| 32 | transposed forms too ("BAC" and "ACB"), in addition to "<start>AB", "ABC", |
|---|
| 33 | and "BC<end>". |
|---|
| 34 | |
|---|
| 35 | For a two character word (e.g. "AB"), we generate the special start and end |
|---|
| 36 | trigrams for the reversed form (i.e. "BA"), so the trigrams are "<start>AB", |
|---|
| 37 | "AB<end>", "<start>BA", and "BA<end>". |
|---|
| 38 | |
|---|
| 39 | And for two, three, and four character words, we generate "bookend" bigrams |
|---|
| 40 | consisting of the prefix 'B' followed by the first and last letters. This |
|---|
| 41 | allows us to handle transposition of the middle two characters of a four |
|---|
| 42 | letter word, substitution or deletion of the middle character of a three |
|---|
| 43 | letter word, or insertion in the middle of a two letter word. |
|---|
| 44 | |
|---|
| 45 | Note that we don't attempt to suggest corrections for single character words |
|---|
| 46 | at all, since the suggestions are unlikely to be of good quality (we'd always |
|---|
| 47 | suggest the same correction for a given database, probably "a" for English). |
|---|
| 48 | We also don't currently attempt to suggest substitution corrections for two |
|---|
| 49 | character words, though this would perhaps be useful in some cases. |
|---|
| 50 | |
|---|
| 51 | Those candidates with the better trigram matches are compared to the misspelled |
|---|
| 52 | word by calculating the "edit distance" - that's the smallest number of |
|---|
| 53 | operations required to turn one word into another. The allowed operations |
|---|
| 54 | are: insert a character; delete a character; change a character to another; |
|---|
| 55 | transpose two adjacent characters. The candidate with the smallest edit |
|---|
| 56 | distance is found, and if more than one word has the smallest edit distance, |
|---|
| 57 | that which occurs the most times is chosen. If there's a tie of this too, |
|---|
| 58 | it's essentially arbitrary which is chosen. |
|---|
| 59 | |
|---|
| 60 | The maximum edit distance to consider can be specified as an optional parameter |
|---|
| 61 | to Xapian::Database::get_spelling_suggestion(). If not specified, the default |
|---|
| 62 | is 2, which generally does a good job. 3 is also a reasonable choice in many |
|---|
| 63 | cases. For most uses, 1 is probably too low, and 4 or more probably too high. |
|---|
| 64 | |
|---|
| 65 | Unicode Support |
|---|
| 66 | --------------- |
|---|
| 67 | |
|---|
| 68 | Trigrams are generated at the byte level, but the edit distance calculation |
|---|
| 69 | currently works with Unicode characters, so get_spelling_suggestion() should |
|---|
| 70 | suggest suitable spelling corrections respecting the specified (or default) |
|---|
| 71 | edit distance threshold. |
|---|
| 72 | |
|---|
| 73 | QueryParser Integration |
|---|
| 74 | ======================= |
|---|
| 75 | |
|---|
| 76 | If FLAG_SPELLING_CORRECTION is passed to QueryParser::parse_query() and |
|---|
| 77 | QueryParser::set_database() has been called, the QueryParser will look for |
|---|
| 78 | corrections for words in the query which aren't found in the database. |
|---|
| 79 | |
|---|
| 80 | If a correction is found, then a modified version of the query string will be |
|---|
| 81 | generated which can be obtained by calling |
|---|
| 82 | QueryParser::get_corrected_query_string(). However, the original query string |
|---|
| 83 | will still be parsed, since you'll often want to ask the user "Did you mean: |
|---|
| 84 | [...] ?" - if you want to automatically use the corrected form, just call |
|---|
| 85 | QueryParser::parse_query() on it. |
|---|
| 86 | |
|---|
| 87 | Current Limitations |
|---|
| 88 | =================== |
|---|
| 89 | |
|---|
| 90 | Exactness |
|---|
| 91 | --------- |
|---|
| 92 | |
|---|
| 93 | Because Xapian only tests the edit distance for terms which match |
|---|
| 94 | well (or at all!) on trigrams, it may not always suggest the same answer that |
|---|
| 95 | would be found if all possible words were checked using the edit distance |
|---|
| 96 | algorithm. However, the best answer will usually be found, and an exhaustive |
|---|
| 97 | search would be prohibitively expensive for many uses. |
|---|
| 98 | |
|---|
| 99 | Backend Support |
|---|
| 100 | --------------- |
|---|
| 101 | |
|---|
| 102 | Currently spelling correction is only supported for flint databases. It |
|---|
| 103 | works with a single database or multiple databases (use |
|---|
| 104 | Database::add_database() as usual). We've no plans to support it for the |
|---|
| 105 | deprecated Quartz backend, nor for InMemory, but we do intend to support it for |
|---|
| 106 | the remote backend in the future. |
|---|
| 107 | |
|---|
| 108 | Prefixed Terms |
|---|
| 109 | -------------- |
|---|
| 110 | |
|---|
| 111 | Currently spelling correction ignores prefixed terms. |
|---|
| 112 | |
|---|
| 113 | Omega |
|---|
| 114 | ----- |
|---|
| 115 | |
|---|
| 116 | Spelling correction hasn't yet been integrated into Omega. |
|---|
| 117 | |
|---|
| 118 | QueryParser changed word locations |
|---|
| 119 | ---------------------------------- |
|---|
| 120 | |
|---|
| 121 | The QueryParser doesn't currently report the locations of changed words in |
|---|
| 122 | the query string, so it's a bit fiddly to mark up the altered words specially |
|---|
| 123 | in HTML output, for example. |
|---|
| 124 | |
|---|
| 125 | References |
|---|
| 126 | ========== |
|---|
| 127 | |
|---|
| 128 | The algorithm used to calculate the edit distance is based on that described in |
|---|
| 129 | the paper "An extension of Ukkonen's enhanced dynamic programming ASM |
|---|
| 130 | algorithm" by Hal Berghel, University of Arkansas, and David Roach, Acxiom |
|---|
| 131 | Corporation. It's available online at: |
|---|
| 132 | http://berghel.net/publications/asm/asm.php |
|---|