The Flint Term List Table
This page describes the format of the Term List table in the FlintBackend.
This takes one extra byte compared to Quartz, but means that when appending documents to a database, the insert is always in the same place (at the "end" of the table). This is faster, and produces a more compact database without a separate compaction step.
The plan for Chert or later is to subclass the compare routine so we can store the key as compactly as Quartz did but keep the improved sort order.
If the document isn't indexed by any terms, then the tag is empty.
Otherwise, the tag starts:
pack_uint(doclen) + pack_uint(termlist_size)
Optionally followed by a
'0' character (only required if the first term is
48 characters long!)
char(term.size()) + term + pack_uint(wdf)
Followed by zero or more entries, each of which can take one of two forms
(the first is used when
packed < 256):
char(packed) + char(term.size() - reuse) + term.substr(reuse)
char(reuse) + char(term.size() - reuse) + term.substr(reuse) + pack_uint(wdf)
In the above, reuse is the length of the longest common prefix between this term and the previous one, and packed is:
(wdf + 1) * (prev_term.size() + 1) + reuse
Care is needed to avoid the above overflowing - in reality we first check
wdf < 127 as packed will never fit in a byte if it isn't.
To unpack, we know that
reuse <= prev_term.size()
if this condition doesn't hold, we know we must have the first form and unpack
reuse = value % (prev_term.size() + 1); wdf = (value / (prev_term.size() + 1)) - 1;
The tag may be stored compressed by zlib, which gives a decent amount of extra compression. Very small tags (less than the minimum size of compressed output) are stored uncompressed.
For chert or later, we could use a similar packing to the above, but combine
term.size() - reuse into a two byte value consisting of
reuse + (prev_term.size() + 1) * (term.size() - reuse)
reuse + (prev_term.size() + 1) * (term.size() - reuse) + (prev_term.size() + 1) * (256 - prev_term.size()) * wdf
This will allow the wdf byte to be omitted more often.
The second form will be used whenever the wdf is small enough to fit, so when it won't fit we can subtract one more than the largest value which would fit from the wdf and store that, which will allow pack_uint() to save a byte sometimes if the wdf values ever exceed 127!