wiki:HoneyEncodings

Posting list compression

Posting bitmap

Papers for start:

Some encodings, that store bitfields(inc. Roaring and Slicing(AKA RUP?)), can do logical operations without converting to list of docids.

Hierarchical index encodings can provide optimized nextGEQ(n) primitive, that skips decompressing of docids, which are certanly below docid n.

Certain encodings rely on trees(or shall I say tries?) as main way to store list of documents:

Compressing data alongside posting bitmap

It is desired to store list of unsorted integers for ranking purposes, such as wdf. Since usually queries have AND operation as top operation, most of additional data will end up discarded, so it makes sense to use such data structure, which allows to skip decoding most of it.

  1. Partial decoding like in Stream VByte, where reading 1 byte is enough to skip 4 values
  2. Block skip by skipping entire blocks of known amount of values without reading it.

Option 1 can be made faster by turning AoS into SoA or at least tiling of control bytes. Given 2^n bytes long register, multiple of 5 * 2^n control bytes span for tiled(AoSoA) is optimal for Stream VByte, because it allows length accumulation with 1 shift and 2 and operations per control byte + n + 1 shifts, ands and adds at the end, but SoA does not waste memory bandwidth for skipped values.

Option 2 can be combined with Universe Partitioning(Roaring, Slicing, rTrie).

Mentioned above techniques can also store minimum and/or maximum of wdf and other values at their nodes as (C - 1) * b numbers plus (C - 1) * v bits, where C is number of child nodes, b is number of bounds and v is number of values. This would allow early tree-level value intersection: rejecting branches with too small or too big values(including early top-k), visit reordering.

Last modified 3 weeks ago Last modified on 04/02/25 04:37:09
Note: See TracWiki for help on using the wiki.