The Flint Term List Table
This page describes the format of the Term List table in the FlintBackend.
Key Format
pack_uint_preserving_sort(docid)
This takes one extra byte compared to Quartz, but means that when appending documents to a database, the insert is always in the same place (at the "end" of the table). This is faster, and produces a more compact database without a separate compaction step.
The plan for Chert or later is to subclass the compare routine so we can store the key as compactly as Quartz did but keep the improved sort order.
Tag Format
If the document isn't indexed by any terms, then the tag is empty.
Otherwise, the tag starts:
pack_uint(doclen) + pack_uint(termlist_size)
Optionally followed by a '0'
character (only required if the first term is
48 characters long!)
Followed by:
char(term.size()) + term + pack_uint(wdf)
Followed by zero or more entries, each of which can take one of two forms
(the first is used when packed < 256
):
char(packed) + char(term.size() - reuse) + term.substr(reuse)
char(reuse) + char(term.size() - reuse) + term.substr(reuse) + pack_uint(wdf)
In the above, reuse is the length of the longest common prefix between this term and the previous one, and packed is:
(wdf + 1) * (prev_term.size() + 1) + reuse
Care is needed to avoid the above overflowing - in reality we first check
that wdf < 127
as packed will never fit in a byte if it isn't.
To unpack, we know that reuse <= prev_term.size()
if this condition doesn't hold, we know we must have the first form and unpack
it thus:
reuse = value % (prev_term.size() + 1); wdf = (value / (prev_term.size() + 1)) - 1;
The tag may be stored compressed by zlib, which gives a decent amount of extra compression. Very small tags (less than the minimum size of compressed output) are stored uncompressed.
For chert or later, we could use a similar packing to the above, but combine
reuse
and term.size() - reuse
into a two byte value consisting of
either:
reuse + (prev_term.size() + 1) * (term.size() - reuse)
reuse + (prev_term.size() + 1) * (term.size() - reuse) + (prev_term.size() + 1) * (256 - prev_term.size()) * wdf
This will allow the wdf byte to be omitted more often.
The second form will be used whenever the wdf is small enough to fit, so when it won't fit we can subtract one more than the largest value which would fit from the wdf and store that, which will allow pack_uint() to save a byte sometimes if the wdf values ever exceed 127!