Posting list compression
Posting bitmap
Papers for start:
- An Experimental Study of Bitmap Compression vs. Inverted List Compression
- Techniques for Inverted Index Compression
Some encodings, that store bitfields(inc. Roaring and Slicing(AKA RUP?)), can do logical operations without converting to list of docids.
Hierarchical index encodings can provide optimized nextGEQ(n)
primitive, that skips decompressing of docids, which are certanly below docid n
.
Certain encodings rely on trees(or shall I say tries?) as main way to store list of documents:
Compressing data alongside posting bitmap
It is desired to store list of unsorted integers for ranking purposes, such as wdf. Since usually queries have AND operation as top operation, most of additional data will end up discarded, so it makes sense to use such data structure, which allows to skip decoding most of it.
- Partial decoding like in Stream VByte, where reading 1 byte is enough to skip 4 values
- Block skip by skipping entire blocks of known amount of values without reading it.
Option 1 can be made faster by turning AoS into SoA or at least tiling of control bytes. Given 2^n
bytes long register, multiple of 5 * 2^n
control bytes span for tiled(AoSoA) is optimal for Stream VByte, because it allows length accumulation with 1 shift and 2 and operations per control byte + n + 1
shifts, ands and adds at the end, but SoA does not waste memory bandwidth for skipped values.
Option 2 can be combined with Universe Partitioning(Roaring, Slicing, rTrie).
Mentioned above techniques can also store minimum and/or maximum of wdf and other values at their nodes as (C - 1) * b
numbers plus (C - 1) * v
bits, where C
is number of child nodes, b
is number of bounds and v
is number of values. This would allow early tree-level value intersection: rejecting branches with too small or too big values(including early top-k), visit reordering.