Changes between Version 7 and Version 8 of FlintTermListTable

Show
Ignore:
Timestamp:
2008-08-21 12:18:18 (2 years ago)
Author:
olly
Comment:

Update for 1.0.7

Legend:

Unmodified
Added
Removed
Modified
  • FlintTermListTable

    v7 v8  
    55== Key Format == 
    66 
    7 Quartz stores the key as: lsb ... msb of the docid, until all remaining bytes are zero 
     7{{{ 
     8pack_uint_preserving_sort(docid) 
     9}}} 
    810 
    9 Flint currently uses pack_uint_preserving_sort(docid) - this takes one extra byte compared to 
     11This takes one extra byte compared to 
    1012Quartz, but means that when appending documents to a database, the insert is always 
    1113in the same place (at the "end" of the table).  This is faster, and produces a more compact 
    1214database without a separate compaction step. 
    1315 
    14 (The eventual plan is to subclass the compare routine so we can store the key 
    15 as compactly as Quartz does but keep the improved sort order.) 
     16The plan for Chert or later is to subclass the compare routine so we can store the key 
     17as compactly as Quartz did but keep the improved sort order. 
    1618 
    1719== Tag Format == 
    1820 
    19 The tag is formed exactly as in Quartz, but may be stored compressed by zlib, which gives a 
     21If the document isn't indexed by any terms, then the tag is empty. 
     22 
     23Otherwise, the tag starts: 
     24 
     25{{{ 
     26pack_uint(doclen) + 
     27pack_uint(termlist_size) 
     28}}} 
     29 
     30Optionally followed by a {{{'0'}}} character (only required if the first term is 
     3148 characters long!) 
     32 
     33Followed by: 
     34 
     35{{{ 
     36char(term.size()) + 
     37term + 
     38pack_uint(wdf) 
     39}}} 
     40 
     41Followed by zero or more entries, each of which can take one of two forms 
     42(the first is used when {{{packed < 256}}}): 
     43 
     44{{{ 
     45char(packed) + 
     46char(term.size() - reuse) + 
     47term.substr(reuse) 
     48}}} 
     49 
     50{{{ 
     51char(reuse) + 
     52char(term.size() - reuse) + 
     53term.substr(reuse) + 
     54pack_uint(wdf) 
     55}}} 
     56 
     57In the above, reuse is the length of the longest common prefix between 
     58this term and the previous one, and packed is: 
     59 
     60{{{ 
     61(wdf + 1) * (prev_term.size() + 1) + reuse 
     62}}} 
     63 
     64Care is needed to avoid the above overflowing - in reality we first check 
     65that {{{wdf < 127}}} as packed will never fit in a byte if it isn't. 
     66 
     67To unpack, we know that {{{reuse <= prev_term.size()}}} 
     68if this condition doesn't hold, we know we must have the first form and unpack 
     69it thus: 
     70 
     71{{{ 
     72reuse = value % (prev_term.size() + 1); 
     73wdf = (value / (prev_term.size() + 1)) - 1; 
     74}}} 
     75 
     76The tag may be stored compressed by zlib, which gives a 
    2077decent amount of extra compression.  Very small tags (less than the minimum size 
    2178of compressed output) are stored uncompressed. 
    2279 
    23 Quartz first stores pack_uint(document length), and the number of 
    24 entries in the termlist (this value is stored for quick access - it 
    25 could also be determined by running through the termlist).  It then stores the 
    26 entries, in sorted order.  For the first term, we store: 
    27 termname.length() (in a byte) then the termname, then pack_uint(wdf). 
    28 For subsequent terms, we store: length of the common prefix of this termname 
    29 and the previous one (in a byte), length of string to add to this prefix 
    30 (in a byte) then the string to add, then pack_uint(wdf). 
     80----- 
    3181 
    32 This is correct 
    33 prior to 0.8.0, 0.8.0 and later will try to squeeze the wdf into the common 
    34 prefix byte - we know that the prefix reuse length is <= the length of the 
    35 previous term, so if this value is >, we take: 
    36 reuse_length = value % (prev_term_length + 1) 
    37 wdf = (value / (prev_term_length + 1)) - 1 
    38 and then we omit the byte which would otherwise store pack_uint(wdf) for this term. 
     82For chert or later, we could use a similar packing to the above, but combine 
     83{{{reuse}}} and {{{term.size() - reuse}}} into a two byte value consisting of 
     84either: 
    3985 
    40 The proposed change is to use a similar packing to that used by quartz in 0.8.0 and 
    41 later, but to combine the reuse_length and 
    42 add_length into a two byte value consisting of either: 
     86{{{ 
     87reuse + (prev_term.size() + 1) * (term.size() - reuse) 
     88}}} 
    4389 
    44  * reuse_length + (prev_term_length + 1) * (add_length) 
    45  
    46  * reuse_length + (prev_term_length + 1) * add_length + (prev_term_length + 1) * (256 - prev_term_length) * wdf 
     90{{{ 
     91reuse + (prev_term.size() + 1) * (term.size() - reuse) + (prev_term.size() + 1) * (256 - prev_term.size()) * wdf 
     92}}} 
    4793 
    4894This will allow the wdf byte to be omitted more often.