Opened 15 years ago
Last modified 13 months ago
#451 new enhancement
Add option to compaction to rebuild postlist chunks
Reported by: | Richard Boulton | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Library API | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Currently, xapian-compact simply stitches existing chunks in the postlist and value list together. This is fast, but has two significant drawbacks:
- If document IDs are being preserved (via the --no-renumber option), xapian-compact cannot merge databases with overlapping document ID ranges (even if no documents occur in both databases).
- Modifications to a database can result in many small chunks; recombining these chunks into larger chunks should result in faster searches. Xapian-compact doesn't currently do this.
I propose adding a new option to xapian-compact: "--rebuild-chunks", which rebuilds the postlist chunks (and also the valuelist chunks, and the document length chunks), packing them optimally, and allowing overlapping document ids. I have implemented a patch which adds this option for the chert backend, with some tests in api_compact.cc, and this seems to work well for me.
Attachments (1)
Change History (7)
by , 15 years ago
Attachment: | compactchunks.patch added |
---|
comment:1 by , 12 years ago
Component: | Other → Library API |
---|---|
Milestone: | 1.2.x → 1.3.2 |
Summary: | Add option to xapian-compact to rebuild postlist chunks → Add option to compaction to rebuild postlist chunks |
There's a patch, so we should update it sort out getting it applied for a 1.3.x snapshot - marking for 1.3.2 for now. It'll need updating for the changes to make compaction an API though.
comment:2 by , 11 years ago
Milestone: | 1.3.2 → 1.3.3 |
---|
comment:3 by , 10 years ago
Milestone: | 1.3.3 → 1.3.4 |
---|
comment:4 by , 9 years ago
Milestone: | 1.3.4 → 1.4.x |
---|
I'm about to rework the API to compaction, which will allow adding new flags without an ABI break, so there isn't a reason to hold up 1.4.0 for this.
comment:5 by , 5 years ago
Milestone: | 1.4.x → 1.5.0 |
---|---|
Version: | SVN trunk → git master |
If document IDs are being preserved (via the --no-renumber option), xapian-compact cannot merge databases with overlapping document ID ranges (even if no documents occur in both databases).
I wonder if this one is really an unreasonable limitation. Nobody's complained about it since that I can recall. Did you have a use case for it?
Modifications to a database can result in many small chunks; recombining these chunks into larger chunks should result in faster searches. Xapian-compact doesn't currently do this.
Ideally these would get combined in the normal course of operations, but even then there's still the case of merging several databases and a term occurring a small number of times in each - then we potentially have one small postlist chunk per input database.
217a67f792a93ceb085749c42a66c8829f1a9573 improves this for honey on git master - now adjacent input chunks are spliced together until doing so would exceed HONEY_POSTLIST_CHUNK_MAX. We don't try to split input chunks currently so it's not a full version of what's proposed here, but this splicing can be done without decoding so it's faster.
At this point I don't think we'd do this for glass or 1.4.x, but rather for honey in the next release series.
comment:6 by , 13 months ago
Milestone: | 1.5.0 → 2.0.0 |
---|
Patch implementing a --rebuild-chunks option to xapian-compact for chert databases