Opened 9 years ago

Closed 8 years ago

#686 closed defect (fixed)

Backend support for >= 2**32 documents

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.3.4
Component: Backend-Glass Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Now that #385 is fixed, xapian-core can be built with 64-bit document ids, but the current disk-based backends use encodings which don't allow document ids >= 2**32. The API support is still very useful, as it means you can search multiple database together without exhausting the document id space, but it ought to be possible to do this within a single database too.

Marking as 1.3.x and glass, at least for now, as it's probably just a simple encoding tweak in a small number of places.

Change History (6)

comment:1 by Olly Betts, 9 years ago

Description: modified (diff)

comment:2 by Olly Betts, 8 years ago

[b8fcf940fb00e03bb6c2725784e72e79c1ffbabf] changed glass to allow keys to be up to 255 bytes long (rather than 252 as previously), which means there's more space in the keys without having to reduce the term length limit.

comment:3 by Olly Betts, 8 years ago

And I have a new encoding for pack_uint_preserving_sort() which extends up to 64 bits nicely. For 32-bit values, it takes one fewer byte to encode 0x4000-0x7fff, but one more byte to encode 0x20000000-0x3fffffff. Overall that's probably a win for most people, with a small penalty if you have much more than half a billion documents, and naturally extends us to 64 bits which seems a sane trade-off.

The obvious tweak to the current encoding which would allow 64 bits is to use 3 bits for the width instead of 2, and that is always the same or worse than my new encoding until you have more than 144 million billion documents.

The basic idea is that the first byte of the encoding is a run of 0 to 7 1 bits, followed by a 0 bit. Any remaining bits store the most significant bits of the value. The number of 1 bits in the run is one fewer than the number of bytes of value which follow (so the encoding is at least 2 bytes long, and at most 9 (or at most 5 for 32-bit values).

Last edited 8 years ago by Olly Betts (previous) (diff)

comment:4 by Olly Betts, 8 years ago

I wrote a script to compare the encodings.

  • now is the current encoding, shown for comparison
  • now3 is the current encoded extended to use a 3 bit length.
  • unary is my proposed new encoding
  • the others are as described in Managing Gigabytes 2nd ed, p117 (golbx being Golomb with b=x)

The cell values are the number of bytes needed to encode a value with that many bits with the specified encoding.

The shortest values per row are bold, and the longest italic (it happens every value is either the best or worst in its row). For rows where all values are the same, bold and italic aren't used. The now encoding is ignored when calculating the best and worst (since it can't represent the full 64-bit range).

bits now3 unarygammadeltagolb3golb6bestnow
1-13 2 2 2 2 2 2 2(2)
14 3 22223 2(2)
15 3 2223 3 2(3)
16-20 3 3 3 3 3 3 3(3)
21 3334 333(3)
22 4 34 4 4 4 3(3)
23-28 4 4 4 4 4 4 4(4)
29 4445 45 4(4)
30-35 5 5 5 5 5 5 5(5)
36 556 6 555(5)
37 56 6 6 56 5(5)
38-43 6 6 6 6 6 6 6(5) for 38 only
44 67 7 7 666
45 67 7 7 7 7 6
46-50 7 7 7 7 7 7 7
51 78 77777
52 78 8 8 777
53 78 8 8 8 8 7
54-57 8 8 8 8 8 8 8
58 89 88888
59 89 88888
60 89 9 9 888
61 89 9 9 9 9 8
62-64 9 9 9 9 9 9 9

No one encoding is best everywhere, but unary seems as good as any of the other encodings in the table, and is the only one which is at least as good as the current encoding for all values currently encoded (i.e. it wouldn't add overhead to any database you can currently work with).

comment:5 by Olly Betts, 8 years ago

Milestone: 1.3.x1.3.4

comment:6 by Olly Betts, 8 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.