Context Navigation

The Flint Spelling Table

This page describes the format of the Spelling table in the FlintBackend. This table is used to implement the spelling correction feature.

Key Format

Keys take one of the following forms:

Bookends:

'B' + first_char + last_char

Head bigrams:

'H' + <first two characters>

Mid trigrams:

'M' + trigram

Tail bigrams:

'T' + <last two characters>

Word frequencies:

'W' + word

Tag Format

For word frequency entries, the tag format is:

pack_uint_last(freq)

For the first four, the tag contains a sorted list of all the words which match. The first word is encoded as:

char(word.size() ^ MAGIC_XOR_VALUE) +
word

The second and subsequent words as (where reuse is the maximal common prefix length):

char(reuse ^ MAGIC_XOR_VALUE) +
char((word.size() - reuse) ^ MAGIC_XOR_VALUE) +
word.substr(reuse)

MAGIC_XOR_VALUE is 96. We XOR the length values with this so that they are more likely to coincide with lower-case ASCII letters, which are likely to be common in the words. This means that zlib should do a better job of compressing these tag values, and in tests this gave 5% better compression.

The tag may be stored compressed by zlib, which gives a decent amount of extra compression. Very small tags (less than the minimum size of compressed output) are stored uncompressed.

Last modified 17 years ago Last modified on 21/08/08 12:42:04

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text