#819 closed enhancement (fixed)

What is the impact of block_size parameter in Database::compact method ?

Reported by: mgautier Owned by: Olly Betts
Priority: normal Milestone: 1.4.21
Component: Documentation Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

It is unclear to me how the block_size parameter will impact the database.

For a bit of context, we are indexing content once, compact the database (with Compactor::FULLER flag) and distribute the database (readonly) on which we search.

What would be the impact the the block_size ? Does it have a impact on the database size, on searching performance, both ?

It may be good to add this information to the documentation (this way my ticket is really an enhancement and not a simple question :) )

Thanks

Change History (3)

comment:1 by Olly Betts, 20 months ago

It's a good question, but I don't think we really have a clear answer to add to the documentation.

https://getting-started-with-xapian.readthedocs.io/en/latest/advanced/admin_notes.html#databases discusses this a bit:

The .glass file actually stores the data, and is structured as a tree of blocks, which have a default size of 8KB (though this can be set, either through the Xapian API, or with some of the tools detailed later in this document).

Changing the blocksize may have performance implications, but it is hard to know whether these will be positive or negative for a particular combination of hardware and software without doing some profiling.

There's a couple of points we could probably also mention there.

Making the blocksize a multiple of (or the same as) both the sector size of the device and the blocksize of the filing system which the database is on is almost certainly a good plan, but sector size seems to always be 4K or less (https://en.wikipedia.org/wiki/Disk_sector) and FS block size still seems to be 4K by default (the widely used ext4 potentially supports up to 64K but only up to the system page size which is 4K on e.g. x86 and x86-64). So it seems in practice this is typically not actually going to be a consideration.

The main benefits a larger blocksize gives are slightly more efficient packing plus reduced total per-block overheads (and the additional gains here are likely to be smaller for each extra block size doubling), while the downside is needing to read/write more data to read/write a single block. The extra data is at least contiguous (at least in file offset terms - maybe not always on disk) but there are potentially significant negative factors like added pressure on the drive cache and OS file cache. The additional losses are likely to grow for each extra block size doubling.

In general for most people just using the default block size is sensible. It's something you might tune when you either care more about reducing size over anything else, or if you're prepared to profile your complete system with different block sizes to see what works best for your own situation.

BTW, if you're creating a read-only database, using the single-file glass format is worth considering. It's not going to save you disk space (beyond saving a few inodes) but it means only one file needs to be opened to open the database so reduces initialisation overhead a little, and a single file is more convenient if you need to copy it around. You can even embed the database in another file so you can ship a single file containing content and a Xapian database which provides a search of it.

comment:2 by Olly Betts, 20 months ago

Milestone: 1.4.21
Status: newassigned

Setting milestone - I'll try to at least slot in text based on my reply above.

comment:3 by Olly Betts, 19 months ago

Component: OtherDocumentation
Resolution: fixed
Status: assignedclosed

Updated the "getting started" guide, and merged to the copy of this document in the xapian repo in 72a282c84759.

Backported for 1.4.20 in ab9fd7e03de70bb40e2d9906e46578b2e9827bed.

Note: See TracTickets for help on using tickets.