wiki:FlintTermListTable

The Flint Term List Table

This page describes the format of the Term List table in the FlintBackend.

Key Format

pack_uint_preserving_sort(docid)

This takes one extra byte compared to Quartz, but means that when appending documents to a database, the insert is always in the same place (at the "end" of the table). This is faster, and produces a more compact database without a separate compaction step.

The plan for Chert or later is to subclass the compare routine so we can store the key as compactly as Quartz did but keep the improved sort order.

Tag Format

If the document isn't indexed by any terms, then the tag is empty.

Otherwise, the tag starts:

pack_uint(doclen) +
pack_uint(termlist_size)

Optionally followed by a '0' character (only required if the first term is 48 characters long!)

Followed by:

char(term.size()) +
term +
pack_uint(wdf)

Followed by zero or more entries, each of which can take one of two forms (the first is used when packed < 256):

char(packed) +
char(term.size() - reuse) +
term.substr(reuse)
char(reuse) +
char(term.size() - reuse) +
term.substr(reuse) +
pack_uint(wdf)

In the above, reuse is the length of the longest common prefix between this term and the previous one, and packed is:

(wdf + 1) * (prev_term.size() + 1) + reuse

Care is needed to avoid the above overflowing - in reality we first check that wdf < 127 as packed will never fit in a byte if it isn't.

To unpack, we know that reuse <= prev_term.size() if this condition doesn't hold, we know we must have the first form and unpack it thus:

reuse = value % (prev_term.size() + 1);
wdf = (value / (prev_term.size() + 1)) - 1;

The tag may be stored compressed by zlib, which gives a decent amount of extra compression. Very small tags (less than the minimum size of compressed output) are stored uncompressed.


For chert or later, we could use a similar packing to the above, but combine reuse and term.size() - reuse into a two byte value consisting of either:

reuse + (prev_term.size() + 1) * (term.size() - reuse)
reuse + (prev_term.size() + 1) * (term.size() - reuse) + (prev_term.size() + 1) * (256 - prev_term.size()) * wdf

This will allow the wdf byte to be omitted more often.

The second form will be used whenever the wdf is small enough to fit, so when it won't fit we can subtract one more than the largest value which would fit from the wdf and store that, which will allow pack_uint() to save a byte sometimes if the wdf values ever exceed 127!

Last modified 16 years ago Last modified on 08/21/08 12:18:18
Note: See TracWiki for help on using the wiki.