The Flint Spelling Table
This page describes the format of the Spelling table in the FlintBackend. This table is used to implement the spelling correction feature.
Key Format
Keys take one of the following forms:
- Bookends:
'B' + first_char + last_char
- Head bigrams:
'H' + <first two characters>
- Mid trigrams:
'M' + trigram
- Tail bigrams:
'T' + <last two characters>
- Word frequencies:
'W' + word
Tag Format
For word frequency entries, the tag format is:
pack_uint_last(freq)
For the first four, the tag contains a sorted list of all the words which match. The first word is encoded as:
char(word.size() ^ MAGIC_XOR_VALUE) + word
The second and subsequent words as (where reuse is the maximal common prefix length):
char(reuse ^ MAGIC_XOR_VALUE) + char((word.size() - reuse) ^ MAGIC_XOR_VALUE) + word.substr(reuse)
MAGIC_XOR_VALUE is 96. We XOR the length values with this so that they are more likely to coincide with lower-case ASCII letters, which are likely to be common in the words. This means that zlib should do a better job of compressing these tag values, and in tests this gave 5% better compression.
The tag may be stored compressed by zlib, which gives a decent amount of extra compression. Very small tags (less than the minimum size of compressed output) are stored uncompressed.