wiki:GSoC2011/SpellingCorrectionImprovements/ProjectPlan

May 23 — June 5:
Metric and Trigram method optimisations.
There are some possible improvements which will increase the speed of the correction suggestions search twice without quality loss and memory increase. [see my article "Fuzzy string search"]

FastSS algorithm implementation.
The FastSS algorithm is ten times faster than Trigram method in practice. Additionally, it has some variables for tuning the trade off between time and memory. So it should be implemented in addition to Trigram method to allow the user to choose between them. [see article at FastSS site]

June 6 — June 19:
Additional metric over default Damerau-Levenshtein distance for a more accurate ranking.
This metric can use distances between keys on keyboard or positions of mistakes in a word (a mistake at the beginning has more weight) to provide more accurate ranking.

Word suggestion rank is computed using metric distance and frequencies of words, word pairs and word triples in a query.
Context-sensitive spelling correction.

Pay attention to the sticked words or to the splitted parts of a word.

June 20 — June 26:
Separate spelling correction settings for different fields.

"Fuzzy" query operator.

June 27 — July 3:
Phonetic algorithms for selected fields.
Implement some phonetic algorithms (Daitch-Mokotoff Soundex, Metaphone, and so on) to provide easier search for names and surnames.

July 4 — July 10:
Romanisation of selected fields for cross-language search of names or surnames.
This feature will allow users to search for names and surnames in different languages using romanisation.

Keyboard layout mismatch error correction.
If somebody has multiple keyboard layouts, he may write a query using a wrong layout and he still gets the same result.

July 11 — July 17:
List of ranked possible corrections for correction suggestion.
It allows the user to select the most appropriate correction variant.

Query result set is built using multiple correction suggestions.
Result set should contain a search results of first few correction suggestions.

July 18 — July 31:
Dictionary for stemming.
Dictionary may contain root words or stemming rules for certain words.

Document language detection (automatic or manual) to provide different (stemming) algorithms for documents in different languages.

August 1 — August 7:
Remember the user's choice of one of suggested corrections from suggestions list.
This feedback helps us to adjust spelling correction ranks for a more qualitative future suggestions.

August 8 — August 15:
Final documentations, testing and so on.

August 16 — August 22:
Submission of final evaluations to Google by both students and mentors. Pay attention to the sticked words or to the splitted parts of a word.

Last modified 13 years ago Last modified on 09/06/11 20:36:35
Note: See TracWiki for help on using the wiki.