root / tags / 1.0.8 / xapian-core / docs / spelling.rst

Revision 9239, 5.3 kB (checked in by olly, 16 months ago)

docs/spelling.rst: Fix typo.

Line 
1
2.. Copyright (C) 2007 Olly Betts
3
4==========================
5Xapian Spelling Correction
6==========================
7
8.. contents:: Table of contents
9
10Introduction
11============
12
13Xapian provides functionality which can suggest corrections for misspelled
14words in queries, or in other situations where it might be useful.  The
15suggestions are generated dynamically from the data that has been indexed, so
16the correction facility isn't tied to particular languages, and it can suggest
17proper nouns or specialist technical terms.
18
19Algorithm
20=========
21
22A list of candidate words is generated by matching trigrams (groups of 3
23adjacent characters) in the candidates against those in the misspelled
24word.  As well as groups of adjacent characters, "starts" and "ends"
25are generated with the first two and last two characters respectively
26(e.g. "FISH" generates: "<start>FI", "FIS", "ISH", and "SH<end>").
27
28This technique alone would missing many single-edit errors in two and three
29character words, so we handle these specially as follows:
30
31For a three character word (e.g. "ABC"), we generate trigrams for the two
32transposed forms too ("BAC" and "ACB"), in addition to "<start>AB", "ABC",
33and "BC<end>".
34
35For a two character word (e.g. "AB"), we generate the special start and end
36trigrams for the reversed form (i.e. "BA"), so the trigrams are "<start>AB",
37"AB<end>", "<start>BA", and "BA<end>".
38
39And for two, three, and four character words, we generate "bookend" bigrams
40consisting of the prefix 'B' followed by the first and last letters.  This
41allows us to handle transposition of the middle two characters of a four
42letter word, substitution or deletion of the middle character of a three
43letter word, or insertion in the middle of a two letter word.
44
45Note that we don't attempt to suggest corrections for single character words
46at all, since the suggestions are unlikely to be of good quality (we'd always
47suggest the same correction for a given database, probably "a" for English).
48We also don't currently attempt to suggest substitution corrections for two
49character words, though this would perhaps be useful in some cases.
50
51Those candidates with the better trigram matches are compared to the misspelled
52word by calculating the "edit distance" - that's the smallest number of
53operations required to turn one word into another.  The allowed operations
54are: insert a character; delete a character; change a character to another;
55transpose two adjacent characters.  The candidate with the smallest edit
56distance is found, and if more than one word has the smallest edit distance,
57that which occurs the most times is chosen.  If there's a tie of this too,
58it's essentially arbitrary which is chosen.
59
60The maximum edit distance to consider can be specified as an optional parameter
61to Xapian::Database::get_spelling_suggestion().  If not specified, the default
62is 2, which generally does a good job.  3 is also a reasonable choice in many
63cases.  For most uses, 1 is probably too low, and 4 or more probably too high.
64
65Unicode Support
66---------------
67
68Trigrams are generated at the byte level, but the edit distance calculation
69currently works with Unicode characters, so get_spelling_suggestion() should
70suggest suitable spelling corrections respecting the specified (or default)
71edit distance threshold.
72
73QueryParser Integration
74=======================
75
76If FLAG_SPELLING_CORRECTION is passed to QueryParser::parse_query() and
77QueryParser::set_database() has been called, the QueryParser will look for
78corrections for words in the query which aren't found in the database.
79
80If a correction is found, then a modified version of the query string will be
81generated which can be obtained by calling
82QueryParser::get_corrected_query_string().  However, the original query string
83will still be parsed, since you'll often want to ask the user "Did you mean:
84[...] ?" - if you want to automatically use the corrected form, just call
85QueryParser::parse_query() on it.
86
87Current Limitations
88===================
89
90Exactness
91---------
92
93Because Xapian only tests the edit distance for terms which match
94well (or at all!) on trigrams, it may not always suggest the same answer that
95would be found if all possible words were checked using the edit distance
96algorithm.  However, the best answer will usually be found, and an exhaustive
97search would be prohibitively expensive for many uses.
98
99Backend Support
100---------------
101
102Currently spelling correction is only supported for flint databases.  It
103works with a single database or multiple databases (use
104Database::add_database() as usual).  We've no plans to support it for the
105deprecated Quartz backend, nor for InMemory, but we do intend to support it for
106the remote backend in the future.
107
108Prefixed Terms
109--------------
110
111Currently spelling correction ignores prefixed terms.
112
113Omega
114-----
115
116Spelling correction hasn't yet been integrated into Omega.
117
118QueryParser changed word locations
119----------------------------------
120
121The QueryParser doesn't currently report the locations of changed words in
122the query string, so it's a bit fiddly to mark up the altered words specially
123in HTML output, for example.
124
125References
126==========
127
128The algorithm used to calculate the edit distance is based on that described in
129the paper "An extension of Ukkonen's enhanced dynamic programming ASM
130algorithm" by Hal Berghel, University of Arkansas, and David Roach, Acxiom
131Corporation.  It's available online at:
132http://berghel.net/publications/asm/asm.php
Note: See TracBrowser for help on using the browser.