Opened 2 years ago
Closed 2 years ago
#819 closed enhancement (fixed)
What is the impact of block_size parameter in Database::compact method ?
Reported by: | mgautier | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.4.21 |
Component: | Documentation | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
It is unclear to me how the block_size parameter will impact the database.
For a bit of context, we are indexing content once, compact the database (with Compactor::FULLER flag) and distribute the database (readonly) on which we search.
What would be the impact the the block_size ? Does it have a impact on the database size, on searching performance, both ?
It may be good to add this information to the documentation (this way my ticket is really an enhancement and not a simple question :) )
Thanks
Change History (3)
comment:1 by , 2 years ago
comment:2 by , 2 years ago
Milestone: | → 1.4.21 |
---|---|
Status: | new → assigned |
Setting milestone - I'll try to at least slot in text based on my reply above.
comment:3 by , 2 years ago
Component: | Other → Documentation |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
Updated the "getting started" guide, and merged to the copy of this document in the xapian repo in 72a282c84759.
Backported for 1.4.20 in ab9fd7e03de70bb40e2d9906e46578b2e9827bed.
It's a good question, but I don't think we really have a clear answer to add to the documentation.
https://getting-started-with-xapian.readthedocs.io/en/latest/advanced/admin_notes.html#databases discusses this a bit:
There's a couple of points we could probably also mention there.
Making the blocksize a multiple of (or the same as) both the sector size of the device and the blocksize of the filing system which the database is on is almost certainly a good plan, but sector size seems to always be 4K or less (https://en.wikipedia.org/wiki/Disk_sector) and FS block size still seems to be 4K by default (the widely used ext4 potentially supports up to 64K but only up to the system page size which is 4K on e.g. x86 and x86-64). So it seems in practice this is typically not actually going to be a consideration.
The main benefits a larger blocksize gives are slightly more efficient packing plus reduced total per-block overheads (and the additional gains here are likely to be smaller for each extra block size doubling), while the downside is needing to read/write more data to read/write a single block. The extra data is at least contiguous (at least in file offset terms - maybe not always on disk) but there are potentially significant negative factors like added pressure on the drive cache and OS file cache. The additional losses are likely to grow for each extra block size doubling.
In general for most people just using the default block size is sensible. It's something you might tune when you either care more about reducing size over anything else, or if you're prepared to profile your complete system with different block sizes to see what works best for your own situation.
BTW, if you're creating a read-only database, using the single-file glass format is worth considering. It's not going to save you disk space (beyond saving a few inodes) but it means only one file needs to be opened to open the database so reduces initialisation overhead a little, and a single file is more convenient if you need to copy it around. You can even embed the database in another file so you can ship a single file containing content and a Xapian database which provides a search of it.