wiki:FlintSpellingTable

The Flint Spelling Table

This page describes the format of the Spelling table in the FlintBackend. This table is used to implement the spelling correction feature.

Key Format

Keys take one of the following forms:

  • Bookends:
'B' + first_char + last_char
  • Head bigrams:
'H' + <first two characters>
  • Mid trigrams:
'M' + trigram
  • Tail bigrams:
'T' + <last two characters>
  • Word frequencies:
'W' + word

Tag Format

For word frequency entries, the tag format is:

pack_uint_last(freq)

For the first four, the tag contains a sorted list of all the words which match. The first word is encoded as:

char(word.size() ^ MAGIC_XOR_VALUE) +
word

The second and subsequent words as (where reuse is the maximal common prefix length):

char(reuse ^ MAGIC_XOR_VALUE) +
char((word.size() - reuse) ^ MAGIC_XOR_VALUE) +
word.substr(reuse)

MAGIC_XOR_VALUE is 96. We XOR the length values with this so that they are more likely to coincide with lower-case ASCII letters, which are likely to be common in the words. This means that zlib should do a better job of compressing these tag values, and in tests this gave 5% better compression.

The tag may be stored compressed by zlib, which gives a decent amount of extra compression. Very small tags (less than the minimum size of compressed output) are stored uncompressed.

Last modified 16 years ago Last modified on 21/08/08 12:42:04
Note: See TracWiki for help on using the wiki.