| 19 | | The tag is formed exactly as in Quartz, but may be stored compressed by zlib, which gives a |
| | 21 | If the document isn't indexed by any terms, then the tag is empty. |
| | 22 | |
| | 23 | Otherwise, the tag starts: |
| | 24 | |
| | 25 | {{{ |
| | 26 | pack_uint(doclen) + |
| | 27 | pack_uint(termlist_size) |
| | 28 | }}} |
| | 29 | |
| | 30 | Optionally followed by a {{{'0'}}} character (only required if the first term is |
| | 31 | 48 characters long!) |
| | 32 | |
| | 33 | Followed by: |
| | 34 | |
| | 35 | {{{ |
| | 36 | char(term.size()) + |
| | 37 | term + |
| | 38 | pack_uint(wdf) |
| | 39 | }}} |
| | 40 | |
| | 41 | Followed by zero or more entries, each of which can take one of two forms |
| | 42 | (the first is used when {{{packed < 256}}}): |
| | 43 | |
| | 44 | {{{ |
| | 45 | char(packed) + |
| | 46 | char(term.size() - reuse) + |
| | 47 | term.substr(reuse) |
| | 48 | }}} |
| | 49 | |
| | 50 | {{{ |
| | 51 | char(reuse) + |
| | 52 | char(term.size() - reuse) + |
| | 53 | term.substr(reuse) + |
| | 54 | pack_uint(wdf) |
| | 55 | }}} |
| | 56 | |
| | 57 | In the above, reuse is the length of the longest common prefix between |
| | 58 | this term and the previous one, and packed is: |
| | 59 | |
| | 60 | {{{ |
| | 61 | (wdf + 1) * (prev_term.size() + 1) + reuse |
| | 62 | }}} |
| | 63 | |
| | 64 | Care is needed to avoid the above overflowing - in reality we first check |
| | 65 | that {{{wdf < 127}}} as packed will never fit in a byte if it isn't. |
| | 66 | |
| | 67 | To unpack, we know that {{{reuse <= prev_term.size()}}} |
| | 68 | if this condition doesn't hold, we know we must have the first form and unpack |
| | 69 | it thus: |
| | 70 | |
| | 71 | {{{ |
| | 72 | reuse = value % (prev_term.size() + 1); |
| | 73 | wdf = (value / (prev_term.size() + 1)) - 1; |
| | 74 | }}} |
| | 75 | |
| | 76 | The tag may be stored compressed by zlib, which gives a |
| 23 | | Quartz first stores pack_uint(document length), and the number of |
| 24 | | entries in the termlist (this value is stored for quick access - it |
| 25 | | could also be determined by running through the termlist). It then stores the |
| 26 | | entries, in sorted order. For the first term, we store: |
| 27 | | termname.length() (in a byte) then the termname, then pack_uint(wdf). |
| 28 | | For subsequent terms, we store: length of the common prefix of this termname |
| 29 | | and the previous one (in a byte), length of string to add to this prefix |
| 30 | | (in a byte) then the string to add, then pack_uint(wdf). |
| | 80 | ----- |
| 32 | | This is correct |
| 33 | | prior to 0.8.0, 0.8.0 and later will try to squeeze the wdf into the common |
| 34 | | prefix byte - we know that the prefix reuse length is <= the length of the |
| 35 | | previous term, so if this value is >, we take: |
| 36 | | reuse_length = value % (prev_term_length + 1) |
| 37 | | wdf = (value / (prev_term_length + 1)) - 1 |
| 38 | | and then we omit the byte which would otherwise store pack_uint(wdf) for this term. |
| | 82 | For chert or later, we could use a similar packing to the above, but combine |
| | 83 | {{{reuse}}} and {{{term.size() - reuse}}} into a two byte value consisting of |
| | 84 | either: |