Add option to compaction to rebuild postlist chunks

Currently, xapian-compact simply stitches existing chunks in the postlist and value list together. This is fast, but has two significant drawbacks:

  • If document IDs are being preserved (via the --no-renumber option), xapian-compact cannot merge databases with overlapping document ID ranges (even if no documents occur in both databases).
  • Modifications to a database can result in many small chunks; recombining these chunks into larger chunks should result in faster searches. Xapian-compact doesn't currently do this.

I propose adding a new option to xapian-compact: "--rebuild-chunks", which rebuilds the postlist chunks (and also the valuelist chunks, and the document length chunks), packing them optimally, and allowing overlapping document ids. I have implemented a patch which adds this option for the chert backend, with some tests in, and this seems to work well for me.

Patch implementing a --rebuild-chunks option to xapian-compact for chert databases

There's a patch, so we should update it sort out getting it applied for a 1.3.x snapshot - marking for 1.3.2 for now. It'll need updating for the changes to make compaction an API though.

I'm about to rework the API to compaction, which will allow adding new flags without an ABI break, so there isn't a reason to hold up 1.4.0 for this.

If document IDs are being preserved (via the --no-renumber option), xapian-compact cannot merge databases with overlapping document ID ranges (even if no documents occur in both databases).

I wonder if this one is really an unreasonable limitation. Nobody's complained about it since that I can recall. Did you have a use case for it?

Modifications to a database can result in many small chunks; recombining these chunks into larger chunks should result in faster searches. Xapian-compact doesn't currently do this.

Ideally these would get combined in the normal course of operations, but even then there's still the case of merging several databases and a term occurring a small number of times in each - then we potentially have one small postlist chunk per input database.

217a67f792a93ceb085749c42a66c8829f1a9573 improves this for honey on git master - now adjacent input chunks are spliced together until doing so would exceed HONEY_POSTLIST_CHUNK_MAX. We don't try to split input chunks currently so it's not a full version of what's proposed here, but this splicing can be done without decoding so it's faster.

At this point I don't think we'd do this for glass or 1.4.x, but rather for honey in the next release series.

comment:6 by Olly Betts, 15 months ago

