Opened 14 years ago

Last modified 4 months ago

#451 new enhancement

Add option to compaction to rebuild postlist chunks

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone: 2.0.0
Component: Library API Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently, xapian-compact simply stitches existing chunks in the postlist and value list together. This is fast, but has two significant drawbacks:

  • If document IDs are being preserved (via the --no-renumber option), xapian-compact cannot merge databases with overlapping document ID ranges (even if no documents occur in both databases).
  • Modifications to a database can result in many small chunks; recombining these chunks into larger chunks should result in faster searches. Xapian-compact doesn't currently do this.

I propose adding a new option to xapian-compact: "--rebuild-chunks", which rebuilds the postlist chunks (and also the valuelist chunks, and the document length chunks), packing them optimally, and allowing overlapping document ids. I have implemented a patch which adds this option for the chert backend, with some tests in api_compact.cc, and this seems to work well for me.

Attachments (1)

compactchunks.patch (21.4 KB ) - added by Richard Boulton 14 years ago.
Patch implementing a --rebuild-chunks option to xapian-compact for chert databases

Download all attachments as: .zip

Change History (7)

by Richard Boulton, 14 years ago

Attachment: compactchunks.patch added

Patch implementing a --rebuild-chunks option to xapian-compact for chert databases

comment:1 by Olly Betts, 11 years ago

Component: OtherLibrary API
Milestone: 1.2.x1.3.2
Summary: Add option to xapian-compact to rebuild postlist chunksAdd option to compaction to rebuild postlist chunks

There's a patch, so we should update it sort out getting it applied for a 1.3.x snapshot - marking for 1.3.2 for now. It'll need updating for the changes to make compaction an API though.

comment:2 by Olly Betts, 10 years ago

Milestone: 1.3.21.3.3

comment:3 by Olly Betts, 9 years ago

Milestone: 1.3.31.3.4

comment:4 by Olly Betts, 8 years ago

Milestone: 1.3.41.4.x

I'm about to rework the API to compaction, which will allow adding new flags without an ABI break, so there isn't a reason to hold up 1.4.0 for this.

comment:5 by Olly Betts, 4 years ago

Milestone: 1.4.x1.5.0
Version: SVN trunkgit master

If document IDs are being preserved (via the --no-renumber option), xapian-compact cannot merge databases with overlapping document ID ranges (even if no documents occur in both databases).

I wonder if this one is really an unreasonable limitation. Nobody's complained about it since that I can recall. Did you have a use case for it?

Modifications to a database can result in many small chunks; recombining these chunks into larger chunks should result in faster searches. Xapian-compact doesn't currently do this.

Ideally these would get combined in the normal course of operations, but even then there's still the case of merging several databases and a term occurring a small number of times in each - then we potentially have one small postlist chunk per input database.

217a67f792a93ceb085749c42a66c8829f1a9573 improves this for honey on git master - now adjacent input chunks are spliced together until doing so would exceed HONEY_POSTLIST_CHUNK_MAX. We don't try to split input chunks currently so it's not a full version of what's proposed here, but this splicing can be done without decoding so it's faster.

At this point I don't think we'd do this for glass or 1.4.x, but rather for honey in the next release series.

comment:6 by Olly Betts, 4 months ago

Milestone: 1.5.02.0.0
Note: See TracTickets for help on using tickets.