Opened 9 years ago
Closed 9 years ago
#686 closed defect (fixed)
Backend support for >= 2**32 documents
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.4 |
Component: | Backend-Glass | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
Now that #385 is fixed, xapian-core can be built with 64-bit document ids, but the current disk-based backends use encodings which don't allow document ids >= 2**32
. The API support is still very useful, as it means you can search multiple database together without exhausting the document id space, but it ought to be possible to do this within a single database too.
Marking as 1.3.x and glass, at least for now, as it's probably just a simple encoding tweak in a small number of places.
Change History (6)
comment:1 by , 9 years ago
Description: | modified (diff) |
---|
comment:2 by , 9 years ago
comment:3 by , 9 years ago
And I have a new encoding for pack_uint_preserving_sort()
which extends up to 64 bits nicely. For 32-bit values, it takes one fewer byte to encode 0x4000
-0x7fff
, but one more byte to encode 0x20000000
-0x3fffffff
. Overall that's probably a win for most people, with a small penalty if you have much more than half a billion documents, and naturally extends us to 64 bits which seems a sane trade-off.
The obvious tweak to the current encoding which would allow 64 bits is to use 3 bits for the width instead of 2, and that is always the same or worse than my new encoding until you have more than 144 million billion documents.
The basic idea is that the first byte of the encoding is a run of 0 to 7 1
bits, followed by a 0
bit. Any remaining bits store the most significant bits of the value. The number of 1
bits in the run is one fewer than the number of bytes of value which follow (so the encoding is at least 2 bytes long, and at most 9 (or at most 5 for 32-bit values).
comment:4 by , 9 years ago
I wrote a script to compare the encodings.
now
is the current encoding, shown for comparisonnow3
is the current encoded extended to use a 3 bit length.unary
is my proposed new encoding- the others are as described in Managing Gigabytes 2nd ed, p117 (golbx being Golomb with b=x)
The cell values are the number of bytes needed to encode a value with that many bits with the specified encoding.
The shortest values per row are bold, and the longest italic (it happens every value is either the best or worst in its row). For rows where all values are the same, bold and italic aren't used. The now
encoding is ignored when calculating the best and worst (since it can't represent the full 64-bit range).
bits | now3 | unary | gamma | delta | golb3 | golb6 | best | now |
---|---|---|---|---|---|---|---|---|
1-13 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | (2) |
14 | 3 | 2 | 2 | 2 | 2 | 3 | 2 | (2) |
15 | 3 | 2 | 2 | 2 | 3 | 3 | 2 | (3) |
16-20 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | (3) |
21 | 3 | 3 | 3 | 4 | 3 | 3 | 3 | (3) |
22 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | (3) |
23-28 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | (4) |
29 | 4 | 4 | 4 | 5 | 4 | 5 | 4 | (4) |
30-35 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | (5) |
36 | 5 | 5 | 6 | 6 | 5 | 5 | 5 | (5) |
37 | 5 | 6 | 6 | 6 | 5 | 6 | 5 | (5) |
38-43 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | (5) for 38 only |
44 | 6 | 7 | 7 | 7 | 6 | 6 | 6 | |
45 | 6 | 7 | 7 | 7 | 7 | 7 | 6 | |
46-50 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | |
51 | 7 | 8 | 7 | 7 | 7 | 7 | 7 | |
52 | 7 | 8 | 8 | 8 | 7 | 7 | 7 | |
53 | 7 | 8 | 8 | 8 | 8 | 8 | 7 | |
54-57 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | |
58 | 8 | 9 | 8 | 8 | 8 | 8 | 8 | |
59 | 8 | 9 | 8 | 8 | 8 | 8 | 8 | |
60 | 8 | 9 | 9 | 9 | 8 | 8 | 8 | |
61 | 8 | 9 | 9 | 9 | 9 | 9 | 8 | |
62-64 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
No one encoding is best everywhere, but unary seems as good as any of the other encodings in the table, and is the only one which is at least as good as the current encoding for all values currently encoded (i.e. it wouldn't add overhead to any database you can currently work with).
comment:5 by , 9 years ago
Milestone: | 1.3.x → 1.3.4 |
---|
Fixed in [32a10db6d0e60e5366180efd7675d6dc8bba9afd] for 1.3.4.
comment:6 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
[b8fcf940fb00e03bb6c2725784e72e79c1ffbabf] changed glass to allow keys to be up to 255 bytes long (rather than 252 as previously), which means there's more space in the keys without having to reduce the term length limit.