What size should I expect my Xapian database to be?

Typically the Xapian database will be somewhat smaller than the source data (for the glass backend in Xapian 1.4.x), but this can vary quite substantially depending on the nature of the source data and how you are indexing it.

If you don't index positional information for terms, your database will probably be about 2/3 the size. If you index a lot of document values, that will of course add to the database size, as will synonyms and spelling data.

If your database is turning out a lot bigger than you expect, that you may find that running it through xapian-compact will produce a much smaller database. What this tool does is to minimise the unused space in every database block, and eliminate unused blocks.

The Btree implementation has two update modes - linear and non-linear. In linear mode, each block is filled as full as it will go before the next block is started, so blocks end up nearly 100% used (apart for the last block updated in a run of linear updates). In non-linear mode, blocks fill until an update won't fit in the remaining space in the block where it needs to go, at which point that block is split into 2 each ~50% full, so on average non-linear update results in ~75% use of blocks. This is actually how a B-tree achieves its amortised update cost.

If you're just appending documents with add_document(), then for most tables you'll only get linear update, and they end up pretty compact. The exception are the postlist and position tables - the first commit will be entirely linear, and you may get some linear update during subsequent flushes but there's inevitably going to be non-linear update. If the flush threshold is larger, you'll increase the scope for (and size of) spells of linear updating.

Also, to allow the current revision of the database to be searched while updates are being made, blocks to be modified are copied to a new block first, then when the new revision is made live, the old versions of the modified blocks are marked as unused.

The development backend "honey" in git master produces much smaller databases for the same data than glass does (typically 30-40% smaller). It doesn't yet support update, but if you do a lot of searching compared to the amount of indexing you may want to consider using honey for searching already. Work on supporting updatable honey databases is in progress.

And in general, the database size is likely to reduce with newer releases as we come up with and implement more compact ways to store the data.

FAQ Index

Last modified 4 weeks ago Last modified on 28/02/19 22:56:30