GSoC2011/SpellingCorrectionImprovements/Journal – Xapian

wiki:GSoC2011/SpellingCorrectionImprovements/Journal

Context Navigation

Community Bonding Week 1: April 25-May 1
Community Bonding Week 2: May 2-May 8
Community Bonding Week 3: May 9-May 15
Community Bonding Week 4: May 16-May 22
Coding Week 1: May 23-May 29
Coding Week 2: May 30-June 5
Coding Week 3: June 6-June 12
Coding Week 4: June 13-June 19
Coding Week 5: June 20-June 26
Coding Week 6: June 27-July 3
Coding Week 7: July 4-July 10
Coding Week 8: July 11-July 17 (Midterm)
Coding Week 9: July 18-July 24
Coding Week 10: July 25-July 31
Coding Week 11: August 1-August 7
Coding Week 12: August 8-August 14
Coding Week 13: August 15-August 22 (Final evaluation)

Community Bonding Week 1: April 25-May 1

Community Bonding Week 2: May 2-May 8

Community Bonding Week 3: May 9-May 15

Community Bonding Week 4: May 16-May 22

Coding Week 1: May 23-May 29

Refactorings

BrassSpellingTable.
Particular algorithm implementation (n-gram) was moved to derived class (BrassSpellingTableNGram):
Common spelling methods and data (word frequencies) were keeped in BrassSpellingTable.
Method toggle_word(...), parts of merge_changes(...) and open_termlist(...) methods (related to a spelling algorithm) were made pure virtual merge_fragment_changes(...) and populate_word(...) respectively to be implemented in derived classes.

backends/brass/brass_spelling.cc
backends/brass/brass_spelling.h

BrassSpellingNGram - old n-gram spelling algorithm:
backends/brass/brass_spelling_ngram.cc
backends/brass/brass_spelling_ngram.h

N-Gram optimized algorithm

The main idea of this optimization is to process only those words, which contain the specified n-gram at the same position +- k (max error count) in the given word.
The implementation stores n-grams separated by position in the adding word (position is stored after a prefix).
N-Grams that are represent beginning or ending of a word have the placeholder '$' at the first or last position in n-gram.

backends/brass/brass_spelling_new.cc
backends/brass/brass_spelling_new.h

Test shows that the Database::get_spelling_suggestion(...) method now works 80% faster.

For the all methods:

One thing we should pay attention to is the uniqueness of prefixes - no one should be the same as or starts with another. In addition, new methods work at "unicode character" level (not the byte level as it was) which is more logically correct.

Coding Week 2: May 30-June 5

Implemented fastss algorithm. Performance test shows x8 spelling correction search time reduction with 2 errors.

FastSS algorithm

The main idea of this algorithm is to reduce the fuzzy matching to the exact string matching by storing and searching words with skips - char deletions at one or more positions in the word.

For example:

Insertion / Deletion - Match "three" and "tree" One skip in "three" - "t_ree". Exact "tree" - "tree".

Substitution - Match "test" and "tent":
One skip in "test" - "te_t". One skip in "tent" - "te_t".

Transposition - Match "trail" and "trial":
One skip in "trail" - "tra_l". One skip in "trial" - "tr_al".

The implementation assigns numeric indexes to words, and then use these indexes in spelling data: word index and the mask of character skips are packed into one 32-bit integer value and stored sorted (by comparing of two word taking char skips into account) in lists. These lists are stored separately by keys - prefixes of N-character length.

There are the INDEXMAX and the INDEXSTACK values. INDEXMAX keeps value for the next new word. It is increased by 1 each time a new word was added if the INDEXSTACK is empty. INDEXSTACK keeps free indexes which were released after the corresponding word was removed. Last value from this stack is used as index for a newly added word.

There is the LIMIT constant which limits the only first LIMIT chars to be skipped in entry construction. The constant MAX_DISTANCE defines max available error count, which would be used in spelling data construction.

Average space usage - 4 * word length ^{max available errors} / 2 bytes per word.
Average time complexity - O(N log N)

The most appropriate prefix-key length (PREFIX_LENGTH constant) - 4 characters. It allows about 8x faster search than the usual ngram approach.

It covers a bit more errors and doesn't pass some tests because it works with errors in short words too.

It is possible to replace word indexes with word hashes, which allows us to remove INDEXMAX and INDEXSTACK structures and decrease complexity of merge operation. But it will make possible hash collisions and increase the entry size.

backends/brass/brass_spelling_fastss.cc
backends/brass/brass_spelling_fastss.h

Performance test

Performance test is aimed to measure the search time of single-word and sequence-word spelling suggestions - Database::get_spelling_suggestion(...) method.
It uses the randomly shuffled dictionary to fill spelling table (about 25 word pairs per word), 50% words are skipped to allow query to contain unknown words.
Search sequences are built using the same dictionary, but it will contain unknown words because not all words were added to the spelling data.

tests/testdata/dict.txt
tests/perftest/perftest_spelling.h
tests/perftest/perftest_spelling.cc

Coding Week 3: June 6-June 12

Extended edit distance

Implemented extended edit distance (Extended Edit Distance class), which allows precise distance calculation between two words.

Edit distance computation was moved to separated class ExtendedEditDistance.
Implemented using dynamic programming algorithm with Ukkonen cutoff.

Resulting distance is computed using insertion/deletion/substitution/transposition costs, which is calculated, respectively, by the following methods:
get_insert_cost(index, length, character)
get_delete_cost(index, length, character)
get_replace_cost(index, length, first character, second character)
get_transposition_cost(index, length, first character, second character)

Each cost calculation method uses the passed arguments (index of character in the word, length of the word, character itself) to make result more precise and related to the intuitive meaning.
Each cost can be about from 0.75 to 1.25.
The farther from the beginning the character in a word, the higher its cost.

spelling/extended_edit_distance.cc
spelling/extended_edit_distance.h

Frequencies of word pairs

Frequencies of word pairs (bigrams) are stored in addition to single-word (unigram) frequencies to allow spelling correction of word sequences.

Word pair frequencies are stored in the BrassSpellingTable as pair of two word hashes (as a key, 8 bytes) and frequency itself (as a value). Word hashes in the pair are sorted by the lexicographic order of the words, so the word pair is unordered.
There is a small chance of the hash collision but I don't think that the probability is significant, but the space usage reduction is noticeable.

Methods which were added to the BrassSpellingTable class:
add_words(first word, second word);
remove_words(first word, second word);
get_words_frequency(first word, second word);

Methods which were added to the BrassDatabase class:
get_spellings_frequency(...)

Methods which were added to the BrassWritableDatabase class:
add_spellings(...) remove_spellings(...)

Methods which were added to the WritableDatabase class:
add_spelling(first word, second word);
remove_spelling(first word, second word);

backends/brass/brass_spelling.cc
backends/brass/brass_spelling.h

Coding Week 4: June 13-June 19

Base class: SpellingBase. It provides methods for single-word and sequence-word spelling corrections.
Also it contains protected methods request_internal_freq(...) and request_internal(...) which return the absolute and relative frequencies of words and word pairs respectively.

Derived

spelling/spelling_base.h spelling/spelling_base.cc

Word sequence spelling corrections.

Everything related to the spelling correction (from Database::get_spelling_suggestion) has been moved to the SpellingCorrector class.

It provides methods to obtain spelling corrections for a single word or a sequence of words using fuzzy match.
The possible results is sorted by total relative frequency and top N (1 for single suggestion, more for multiple suggestion) of them is returned.
Resulting frequency is computed by summing relative frequencies for overlapping word pairs in word sequence, for example (relative_freq is the request_internal(...) method):

result = relative_freq(the, quick) + relative_freq(quick, brown) + relative_freq(brown, fox) + ...

For each word in a sequence method finds some (defined by LIMIT_CORRECTIONS constant, by default = 5) spelling corrections, ordered by edit distance. Then, it tries to combine it to obtain maximum total relative frequency.

Gapping. Method may skip some (defined by MAX_GAP constant, by default = 1) words in a search query to provide best results.

spelling/spelling_corrector.h
spelling/spelling_corrector.cc

Word sequence splits and joins.

Word sequence splits and joins are provided by SpellingSplitter class.
This class tries to split or merge words in a sequence to find word sticks or breaks and correct them.

The number of simultaneous splits and merges are defined by MAX_SPLIT_COUNT (by default = 2) and MAX_MERGE_COUNT (by default = 2) respectively.
Also it tries to correct unknown parts of words by using SpellingCorrector with TOP_SPELLING_CORRECTIONS (by default = 3) correction variants count per word.

spelling/spelling_splitter.h
spelling/spelling_splitter.cc

Tests

Test 8 for word sequence spelling suggestions was added to the API spelling tests.

tests/api_spelling.cc

Coding Week 5: June 20-June 26

Fuzzy query operator.

There are two fuzzy query operators:

The short version: "The quick #brawn fox jumps."
The full version: "The quick (FUZZY/2 brown fox jumps) over the lazy dog." It makes phrase from the operator to the next operator, closing brase or end of string to be searched including its few spelling suggestions. Also its allows to specify the edit distance.

Separate spelling corrections for prefixed terms.

Separate spelling corrections allow to store spelling data for each prefix and for unprefixed terms separately or grouped with other prefixed spelling data or unprefixed spelling data.

Internally, it is implemented as spelling groups - integer number from 0 to 255. Group 0 means unprefixed terms. This number is integrated into each key (after prefix) that stored by BrassSpellingTable.

Spelling for the certain prefix can be enabled or disabled by the following API functions:
WritableDatabase::enable_spelling(prefix, prefix_group) - enable spelling for the given prefix and use prefix_group spelling data.
WritableDatabase::disable_spelling(prefix) - disable spelling for the given prefix.

Added methods to the BrassSpellingTable and BrassDatabase classes:
is_spelling_enabled(prefix); enable_spelling(prefix, group prefix); disable_spelling(prefix);

Modified files:
backends/brass/brass_database.h
backends/brass/brass_database.cc

backends/brass/brass_spelling.h
backends/brass/brass_spelling.cc

backends/brass/brass_spelling_new.h
backends/brass/brass_spelling_new.cc

backends/brass/brass_spelling_fastss.h
backends/brass/brass_spelling_fastss.cc

backends/brass/brass_spellingwordslist.cc

Also, every related method was modified to provide ability to specify prefix. Empty prefix means unprefixed term.

Prefixed spelling test

Test 9 for prefixed spelling suggestions was added to the API spelling tests.

tests/api_spelling.cc

Coding Week 6: June 27-July 3

Phonetic algorithms: Metaphone, Daitch-Mokotoff Soundex.

Wrapper class: SpellingPhonetic. It includes the SpellingTransliteration object to transliterate the given word to the required alphabet.
Base algorithm class: SpellingPhoneticImpl. It is base for a particular algorithm implementation.
spelling/spelling_phonetic.h
spelling/spelling_phonetic.cc

Metaphone: MetaphoneSpellingPhonetic.
spelling/spelling_phonetic_metaphone.h
spelling/spelling_phonetic_metaphone.cc

Daitch-Mokotoff Soundex: DMSoundexSpellingPhonetic.
spelling/spelling_phonetic_dmsoundex.h
spelling/spelling_phonetic_dmsoundex.cc

Coding Week 7: July 4-July 10

Transliteration (Russian)

Wrapper class SpellingTransliteration
Base class: SpellingTransliterationImpl
spelling/spelling_transliteration.h
spelling/spelling_transliteration.cc

EnglishSpellingTransliteration:
RussianSpellingTransliteration:
spelling/spelling_transliteration_alphabets.h

Transliteration is integrated into SpellingPhonetic class to provide word conversion to an appropriate alphabet.

Keyboard layouts (Russian, Arabic, French, Spain)

Wrapper class: SpellingKeyboard
Base class: SpellingKeyboardImpl
spelling/spelling_keyboard.h spelling/spelling_keyboard.cc

All keyboards:
spelling/spelling_keyboard_layouts.h

SpellingKeyboard contains keyboard keys position map which provides proximity between two given keys by calling get_key_proximity(...) method.

Keyboard Layouts and Transliteration are integrated into SpellingCorrector to provide its functionality for spelling suggestions.

Coding Week 8: July 11-July 17 (Midterm)

Multiple spelling suggestions

API method: Database::get_spelling_suggestions(...);

Multiple spelling suggestion returns multiple variants of spelling corrections for the given word or word sequence ordered by the total relative frequency.
Result count is passed as an argument to the API method.

Internally, it is implemented as combining all possible variants at each step and truncate it to the given count.

Multiple suggestions are provided by SpellingCorrector and SpellingSplitter get_multiple_spelling(...) methods.

spelling/spelling_corrector.h
spelling/spelling_corrector.cc
spelling/spelling_splitter_new.h
spelling/spelling_splitter_new.cc

Coding Week 9: July 18-July 24

Language autodetection. Provides n-gram based TextCat-like approach to identify languages.
It uses language models in TextCat format (but everything in unicode) which stored in languages/classification/ with .lm extension.
The list of available languages is present in languages/classification/languages file.

N-Gram classification method counts n-grams in given text that match n-grams for each language. N-Gram in language model are sorted by frequency, and this frequency is also used in result calculation.

languages/language_autodetect.h languages/classification/*.lm languages/classification/languages

Coding Week 10: July 25-July 31

"Unlikeness" ordering for multiple spelling suggestions.

Coding Week 11: August 1-August 7

Transliteration languages: Chinese, Japanese, Arabic, Hebrew

Coding Week 12: August 8-August 14

Documentation, Tests, User feedback

Coding Week 13: August 15-August 22 (Final evaluation)

Done

Last modified 14 years ago Last modified on 21/08/11 15:07:05

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text