The Flint Post List Table
This page describes the format of the Post List table in the FlintBackend. This table is the index which converts a term to a list of document ids which it occurs in. It is at the core of all queries.
The list of postings for a term is stored in ascending order in one or more chunks (key/tag pairs). This allows the posting lists to implement a skip_to method which will avoid reading all of a long posting list when we don't need to.
The first chunk for each term is slightly different to the rest. Its key just contains the termname:
pack_string_preserving_sort(term)
And the tag has an additional header, after which is followed by tag format described below:
pack_uint(term_freq) + pack_uint(collection_freq) + pack_uint(first_docid)
The key of second and subsequent chunks also includes the first docid of the chunk, encoded in such a way that the chunks sort in order:
pack_string_preserving_sort(term) + pack_uint_preserving_sort(first_docid)
The tag starts with a header:
pack_bool(last_chunk_flag) + pack_uint(final_docid - first_docid) + pack_uint(wdf_of_first_docid) + pack_uint(doclen_of_first_docid)
This is followed by zero or more of:
pack_uint(increase_to_next_docid - 1) + pack_uint(wdf_of_next_docid) + pack_uint(doclen_of_next_docid)
The postlist table also contains a "metainfo" key:
'\0'
The tag for this has the format:
pack_uint(lastdocid) + pack_uint_last(total_length)
And user meta-data is also stored, with keys:
'\0' + '\xc0' + metainfo_key
The tag is just the metainfo tag (unless this is empty when no entry is stored).
In chert, the wdf is no longer stored along with every posting, instead it is stored in a separate postlist-like structure with keys which start like this:
'\0' + '\xe0'
Possible changes for chert and later:
The first tag (only!) starts off with some statistics:
termfreq
, collectionfreq
, lowestdocid
,
highestdocid
. We can store some of these using by coding using the "out of"
encoder from FlintPositionList
- e.g.
termfreq
<= (highestdocid
-lowestdocid
+ 1).
Similarly, collfreq
>= termfreq
and
highestdocid
>= lowestdocid
so actually store
(collfreq
- termfreq
) and
(highestdocid
- lowestdocid
) to make use of the full range from 0.
Because we store highestdocid in the first chunk, we don't need to store an islastchunk flag in each chunk like quartz does. That should help make the update code less fiddly.
Then all tags have:
wdf_1
, docid_2
- docid_1
- 1, wdf_2
, ..., wdf_n-1
, docid_n
- docid_n-1
- 1, wdf_n
(Since docid_1 is lowestdocid for the first chunk and comes from the key for second and subsequent chunks).
Unlike quartz and flint, we don't store the document length with every posting list entry (which is a lot of repeated information). Instead we store a posting list like structure containing all the document lengths.
Try to encode integers so that postlists can be decoded backwards (high bit at each end of the 7 bit encoding; no high bit on a one byte number)? Or just rely on being able to decode a chunk and then run through it backwards?
Note that 7 bit encoding is wasteful - there are multiple encoding for numbers. But encoding without these is more involved and doesn't actually save much space at all in testing on real databases. Perhaps instead make use of the redundant encodings to represent ranges of docids with the same wdf (useful for compactly representing a boolean term which indexes a lot of documents!)