Opened 10 years ago
Closed 9 years ago
#671 closed defect (fixed)
Performance issues when querying over large number of local databases (shards)
Reported by: | Will Greenberg | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.5 |
Component: | Matcher | Version: | |
Severity: | normal | Keywords: | sharding performance |
Cc: | Blocked By: | ||
Blocking: | Operating System: | Linux |
Description (last modified by )
Xapian seems to struggle when performing queries over a large(ish) number of local databases. For my tests, I used a database of ~800,000 documents with ~920,000 terms total sharded among 100 separate xapian databases. These were then "joined" in a single stub database file, and queries were performed by opening that stub database.
The following metrics were gathered on linux 3.13, xapian 1.3.2, on a 7200rpm spinning disk HDD.
I found that calling Enquire::get_mset for single-term queries on this sharded database would take 1300-1600ms to complete, sometimes more. A query on a non-sharded database containing the same data (same # of docs and terms) usually takes <300ms to get the same mset. Running strace on the process doing the queries showed that nearly all of this time was spent waiting for pread64(2) to read a single block of each of the 100 shard's postlist files[1]. This read occurs when MultiMatch is accumulating stats on all of its submatches, and requests termfreqs on each postlist.
Subsequent queries of the same single term would result in times closer to 100ms, implying that these pread64 delays are mitigated when the VFS cache has entries for the postlist files. Working based on this assumption, I tried out a pretty gross patch[2] which calls readahead(2) on the desired postlist blocks well before they're read. The results were significant: Enquire::get_mset for queries on the sharded database with the patched xapian took ~600ms. Presumably, other bad-but-predictable IO paths can be mitigated in the same way (e.g. record.DB reads from Document::get_data).
There are probably other ways of addressing this issue without using readahead(2) or its more portable cousin, posix_fadvise(2). I personally think that the memory hit from priming the cache is acceptable, but I'd be interested to know what others think.
[1] Here's an typical example strace of a sharded query that I ran through xapian's strace-analyse script: https://gist.github.com/wgreenberg/1f8264ff79815fefe5a5
[2] Super hacky patch to test the readahead theory: https://github.com/wgreenberg/xapian/commit/d8af1fde685626283826ecaf52d4ac49cebdea98
Attachments (8)
Change History (26)
by , 10 years ago
Attachment: | ordered_strace.log added |
---|
comment:1 by , 10 years ago
Description: | modified (diff) |
---|
by , 10 years ago
Attachment: | d8af1fde685626283826ecaf52d4ac49cebdea98.patch added |
---|
super hacky patch from github
comment:2 by , 10 years ago
Added readahead_perf_fix.patch which performs posix_fadvise readahead syscalls for each postlist, for each term in a multi-database query. This provides an equivalent performance boost to the hacky patch in d8af1fde685626283826ecaf52d4ac49cebdea98.patch
Not sure what the usual review process is here, but I'm sure there's a lot that needs to be fixed up in this patch before it'll be considered.
by , 10 years ago
Attachment: | readahead_perf_fix.patch added |
---|
First draft of posix_fadvise readahead fix
comment:3 by , 10 years ago
I don't think this patch is doing what you think it is.
Each table has a built-in cursor (C
), which you're using here. It's used for operations which are implemented using a cursor, but for which the cursor doesn't need to live on after the we return to the caller - this mostly just avoids having to create a temporary cursor for every such operation, but also has the benefit that the blocks needed may already have been loaded by a previous operation.
The problem with what you're doing is that you just use whatever is in C
already. For the root block, that's fine, but once j < level
we're searching for a key in whatever block of that level happens to be in the cursor. Most of the time that won't be the right block, so we'll end up on the first or last entry the branch block, depending which side of the right path down the tree we are. So (unless something else happens to be making sure that C
points to the right place, you're pre-reading an essentially arbitrary set of blocks here for the most part.
I guess it gives a performance boost because we will want some of the blocks in that arbitrary set, and so pre-reading something is better than pre-reading nothing - we get reads for free while other stuff is going on.
But I think this should be calling block_to_cursor()
while it descends the tree, and then does the pre-read instead at the lowest level (which might not be the leaf level necessarily, but there's not much point iterating after we stop reading blocks).
comment:4 by , 10 years ago
Component: | Other → Matcher |
---|
Discussion on IRC reveals that wgreenbeeg's postlists only have one level (plus root) for wgreenberg's situation, so the patch does do what's intended there (as the root block should be in C[level]
. For the general case, it either needs to not descend the tree, to actually read the branch blocks touched, or to make multiple passes to read a level and pre-read the one below.
Gmane's postlist.DB is 3 levels (plus the root I think) for 120 million documents, so the trees don't get very deep. A pre-read of just the wanted blocks in the level below the root is likely to help even for those with deeper trees.
comment:5 by , 10 years ago
And confirming this works as intended in this particular case:
strace reports that the readahead offset/size values match the preads exactly (for single-term queries)
comment:6 by , 10 years ago
Thanks for the feedback. I amended the patch to only readahead the top level of the B-tree, and removed some of the unnecessary cruft I copied from ::find.
I removed the stateful changes to the current cursor since, if I understand this right, there's no reason the readahead method should alter cursor state at all. strace again verifies that this patch reads the correct blocks (even for multi-term queries).
I'll try to get another patch in which does a similar readahead optimization for record.DB lookups (using fetch, as per IRC discussion with olly).
by , 10 years ago
Attachment: | single_level_readahead.patch added |
---|
Amended patch that only reads the first level of a postlist b-tree
by , 10 years ago
Attachment: | postlist_record_readahead.patch added |
---|
Implemented readahead for both multi-db queries and MSet::fetch optimization
comment:7 by , 10 years ago
Added a patch which implements ChertDatabase::request_document with the readahead op. This results in a nice speedup when document data is smaller than 1 B+tree block size (i.e. only 1 read is required)
comment:8 by , 10 years ago
Added glass support and ensures posix_fadvise exists before calling it
(sorry if repeatedly amending/uploading patches isn't the way to go here! I could link to my fork of xapian's issue branch if that's more convenient)
by , 10 years ago
Attachment: | readahead_perf_fix_with_glass.patch added |
---|
comment:9 by , 10 years ago
Patches here or a link to a git tree somewhere are both OK with me.
Not sure about throwing UnimplementedError
for backends which don't support this - doesn't that make it impossible to search using them?
Given this is an optimisation, we can probably silently ignore that.
Some of the indentation looks wrong (make sure your tab width is set to 8).
comment:10 by , 10 years ago
Thanks for the review, olly. Here's a revised patch with fixed whitespace/no UnimplementedError.
comment:11 by , 10 years ago
I've been looking at this today.
I've fixed a few issues with the patch:
- We shouldn't try to preread anything for a table with 0 levels, as then the root block is a leaf block, and trying to read it as a branch block at best gives nonsense block numbers to preread, which fail with assertions on.
- The assertion to check the block number is valid for
GlassTable
didn't compile (it was just copied from the chert case it appears).
- I've added a cache of the previous block number preread for each table, and use it to avoid repeated requests for the same block.
- I've made it request the query terms sorted in byte order, which works nicely with the previous change to reduce the number of
posix_fadvise()
calls.
- The termname was being used as the key to the postlist table for prereading, which isn't quite right, though it's close enough for your benchmarking results to be valid.
I'm a bit surprised that the read-ahead for the record/docdata table makes a difference, as it seems to just preread each docid right before reading it, so it seems like it wouldn't really help. But you reported above that this helps IIUC.
I'll attach an updated patch shortly.
comment:12 by , 10 years ago
wgreenberg: Sorry if I've asked this before, but are you OK with the patch licensing requirements?
http://trac.xapian.org/browser/trunk/xapian-core/HACKING#L1364
comment:13 by , 10 years ago
olly: Thanks for the patch fixup! And the licensing requirements aren't a problem at all.
comment:14 by , 9 years ago
FWIW, it's entirely possible that my observed speedup for the record/docdata readaheads were due to experimental error. I didn't test/analyze that case nearly as thoroughly as query performance.
I agree with your suggestion made elsewhere that any docdata preread should occur w/ MSet::fetch(), and would be happy to try testing based on that.
comment:15 by , 9 years ago
Milestone: | → 1.3.4 |
---|---|
Status: | new → assigned |
Actually, the issue with prereading record data is more subtle. The current mechanism uses the same mechanism to read in advance as to read live, which is fine for remote backend prereading, but for a local database I can't see it being useful to do the preread and then actually read.
I've committed the patch with a couple more tweaks in [b904d0e5ae9a1fca9ca0bd1bbe5fe670d3ce1431], with a FIXME comment to sort out the record case. It should work as it is, there's just a pointless preread call when reading a document from the MSet.
That shouldn't be too hard to resolve, and it'd be good to sort out before 1.3.4 so setting the milestone to that.
comment:16 by , 9 years ago
Incidentally, looking at attachment:ordered_strace.log I notice there are two reads from each postlist table as the shards are opened. I think the second read must be getting the database stats, which back then were stored as an entry in the postlist table. These are now stored in the "version file", so that probably means we now avoid that extra read (and if the tables have more levels, there would have been multiple reads for each postlist table to get that information).
comment:18 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
I've fixed it so prereading for the record/docdata table only happens if the user calls MSet::fetch()
in [52c5b2191308f99695c2a7fd54f79c78c85920d4].
With that, our work here is done, so closing.
strace log from gist