Opened 9 years ago

Closed 8 years ago

#671 closed defect (fixed)

Performance issues when querying over large number of local databases (shards)

Reported by: Will Greenberg Owned by: Olly Betts
Priority: normal Milestone: 1.3.5
Component: Matcher Version:
Severity: normal Keywords: sharding performance
Cc: Blocked By:
Blocking: Operating System: Linux

Description (last modified by Olly Betts)

Xapian seems to struggle when performing queries over a large(ish) number of local databases. For my tests, I used a database of ~800,000 documents with ~920,000 terms total sharded among 100 separate xapian databases. These were then "joined" in a single stub database file, and queries were performed by opening that stub database.

The following metrics were gathered on linux 3.13, xapian 1.3.2, on a 7200rpm spinning disk HDD.

I found that calling Enquire::get_mset for single-term queries on this sharded database would take 1300-1600ms to complete, sometimes more. A query on a non-sharded database containing the same data (same # of docs and terms) usually takes <300ms to get the same mset. Running strace on the process doing the queries showed that nearly all of this time was spent waiting for pread64(2) to read a single block of each of the 100 shard's postlist files[1]. This read occurs when MultiMatch is accumulating stats on all of its submatches, and requests termfreqs on each postlist.

Subsequent queries of the same single term would result in times closer to 100ms, implying that these pread64 delays are mitigated when the VFS cache has entries for the postlist files. Working based on this assumption, I tried out a pretty gross patch[2] which calls readahead(2) on the desired postlist blocks well before they're read. The results were significant: Enquire::get_mset for queries on the sharded database with the patched xapian took ~600ms. Presumably, other bad-but-predictable IO paths can be mitigated in the same way (e.g. record.DB reads from Document::get_data).

There are probably other ways of addressing this issue without using readahead(2) or its more portable cousin, posix_fadvise(2). I personally think that the memory hit from priming the cache is acceptable, but I'd be interested to know what others think.

[1] Here's an typical example strace of a sharded query that I ran through xapian's strace-analyse script: https://gist.github.com/wgreenberg/1f8264ff79815fefe5a5

[2] Super hacky patch to test the readahead theory: https://github.com/wgreenberg/xapian/commit/d8af1fde685626283826ecaf52d4ac49cebdea98

Attachments (8)

ordered_strace.log (35.3 KB ) - added by Olly Betts 9 years ago.
strace log from gist
d8af1fde685626283826ecaf52d4ac49cebdea98.patch (2.5 KB ) - added by Olly Betts 9 years ago.
super hacky patch from github
readahead_perf_fix.patch (6.6 KB ) - added by Will Greenberg 9 years ago.
First draft of posix_fadvise readahead fix
single_level_readahead.patch (6.5 KB ) - added by Will Greenberg 9 years ago.
Amended patch that only reads the first level of a postlist b-tree
postlist_record_readahead.patch (7.7 KB ) - added by Will Greenberg 9 years ago.
Implemented readahead for both multi-db queries and MSet::fetch optimization
readahead_perf_fix_with_glass.patch (12.0 KB ) - added by Will Greenberg 9 years ago.
readahead_perf_fix_with_glass2.patch (11.8 KB ) - added by Will Greenberg 9 years ago.
Revised patch
readahead-improved.patch (13.6 KB ) - added by Olly Betts 9 years ago.
improved patch

Download all attachments as: .zip

Change History (26)

by Olly Betts, 9 years ago

Attachment: ordered_strace.log added

strace log from gist

comment:1 by Olly Betts, 9 years ago

Description: modified (diff)

by Olly Betts, 9 years ago

super hacky patch from github

comment:2 by Will Greenberg, 9 years ago

Added readahead_perf_fix.patch which performs posix_fadvise readahead syscalls for each postlist, for each term in a multi-database query. This provides an equivalent performance boost to the hacky patch in d8af1fde685626283826ecaf52d4ac49cebdea98.patch​

Not sure what the usual review process is here, but I'm sure there's a lot that needs to be fixed up in this patch before it'll be considered.

Last edited 9 years ago by Will Greenberg (previous) (diff)

by Will Greenberg, 9 years ago

Attachment: readahead_perf_fix.patch added

First draft of posix_fadvise readahead fix

comment:3 by Olly Betts, 9 years ago

I don't think this patch is doing what you think it is.

Each table has a built-in cursor (C), which you're using here. It's used for operations which are implemented using a cursor, but for which the cursor doesn't need to live on after the we return to the caller - this mostly just avoids having to create a temporary cursor for every such operation, but also has the benefit that the blocks needed may already have been loaded by a previous operation.

The problem with what you're doing is that you just use whatever is in C already. For the root block, that's fine, but once j < level we're searching for a key in whatever block of that level happens to be in the cursor. Most of the time that won't be the right block, so we'll end up on the first or last entry the branch block, depending which side of the right path down the tree we are. So (unless something else happens to be making sure that C points to the right place, you're pre-reading an essentially arbitrary set of blocks here for the most part.

I guess it gives a performance boost because we will want some of the blocks in that arbitrary set, and so pre-reading something is better than pre-reading nothing - we get reads for free while other stuff is going on.

But I think this should be calling block_to_cursor() while it descends the tree, and then does the pre-read instead at the lowest level (which might not be the leaf level necessarily, but there's not much point iterating after we stop reading blocks).

Last edited 9 years ago by Olly Betts (previous) (diff)

comment:4 by Olly Betts, 9 years ago

Component: OtherMatcher

Discussion on IRC reveals that wgreenbeeg's postlists only have one level (plus root) for wgreenberg's situation, so the patch does do what's intended there (as the root block should be in C[level]. For the general case, it either needs to not descend the tree, to actually read the branch blocks touched, or to make multiple passes to read a level and pre-read the one below.

Gmane's postlist.DB is 3 levels (plus the root I think) for 120 million documents, so the trees don't get very deep. A pre-read of just the wanted blocks in the level below the root is likely to help even for those with deeper trees.

comment:5 by Olly Betts, 9 years ago

And confirming this works as intended in this particular case:

strace reports that the readahead offset/size values match the preads exactly (for single-term queries)

Last edited 9 years ago by Olly Betts (previous) (diff)

comment:6 by Will Greenberg, 9 years ago

Thanks for the feedback. I amended the patch to only readahead the top level of the B-tree, and removed some of the unnecessary cruft I copied from ::find.

I removed the stateful changes to the current cursor since, if I understand this right, there's no reason the readahead method should alter cursor state at all. strace again verifies that this patch reads the correct blocks (even for multi-term queries).

I'll try to get another patch in which does a similar readahead optimization for record.DB lookups (using fetch, as per IRC discussion with olly).

by Will Greenberg, 9 years ago

Amended patch that only reads the first level of a postlist b-tree

by Will Greenberg, 9 years ago

Implemented readahead for both multi-db queries and MSet::fetch optimization

comment:7 by Will Greenberg, 9 years ago

Added a patch which implements ChertDatabase::request_document with the readahead op. This results in a nice speedup when document data is smaller than 1 B+tree block size (i.e. only 1 read is required)

comment:8 by Will Greenberg, 9 years ago

Added glass support and ensures posix_fadvise exists before calling it

(sorry if repeatedly amending/uploading patches isn't the way to go here! I could link to my fork of xapian's issue branch if that's more convenient)

by Will Greenberg, 9 years ago

comment:9 by Olly Betts, 9 years ago

Patches here or a link to a git tree somewhere are both OK with me.

Not sure about throwing UnimplementedError for backends which don't support this - doesn't that make it impossible to search using them?

Given this is an optimisation, we can probably silently ignore that.

Some of the indentation looks wrong (make sure your tab width is set to 8).

comment:10 by Will Greenberg, 9 years ago

Thanks for the review, olly. Here's a revised patch with fixed whitespace/no UnimplementedError.

by Will Greenberg, 9 years ago

Revised patch

comment:11 by Olly Betts, 9 years ago

I've been looking at this today.

I've fixed a few issues with the patch:

  • We shouldn't try to preread anything for a table with 0 levels, as then the root block is a leaf block, and trying to read it as a branch block at best gives nonsense block numbers to preread, which fail with assertions on.
  • The assertion to check the block number is valid for GlassTable didn't compile (it was just copied from the chert case it appears).
  • I've added a cache of the previous block number preread for each table, and use it to avoid repeated requests for the same block.
  • I've made it request the query terms sorted in byte order, which works nicely with the previous change to reduce the number of posix_fadvise() calls.
  • The termname was being used as the key to the postlist table for prereading, which isn't quite right, though it's close enough for your benchmarking results to be valid.

I'm a bit surprised that the read-ahead for the record/docdata table makes a difference, as it seems to just preread each docid right before reading it, so it seems like it wouldn't really help. But you reported above that this helps IIUC.

I'll attach an updated patch shortly.

by Olly Betts, 9 years ago

Attachment: readahead-improved.patch added

improved patch

comment:12 by Olly Betts, 9 years ago

wgreenberg: Sorry if I've asked this before, but are you OK with the patch licensing requirements?

http://trac.xapian.org/browser/trunk/xapian-core/HACKING#L1364

comment:13 by Will Greenberg, 9 years ago

olly: Thanks for the patch fixup! And the licensing requirements aren't a problem at all.

comment:14 by Will Greenberg, 9 years ago

FWIW, it's entirely possible that my observed speedup for the record/docdata readaheads were due to experimental error. I didn't test/analyze that case nearly as thoroughly as query performance.

I agree with your suggestion made elsewhere that any docdata preread should occur w/ MSet::fetch(), and would be happy to try testing based on that.

comment:15 by Olly Betts, 9 years ago

Milestone: 1.3.4
Status: newassigned

Actually, the issue with prereading record data is more subtle. The current mechanism uses the same mechanism to read in advance as to read live, which is fine for remote backend prereading, but for a local database I can't see it being useful to do the preread and then actually read.

I've committed the patch with a couple more tweaks in [b904d0e5ae9a1fca9ca0bd1bbe5fe670d3ce1431], with a FIXME comment to sort out the record case. It should work as it is, there's just a pointless preread call when reading a document from the MSet.

That shouldn't be too hard to resolve, and it'd be good to sort out before 1.3.4 so setting the milestone to that.

comment:16 by Olly Betts, 8 years ago

Incidentally, looking at attachment:ordered_strace.log I notice there are two reads from each postlist table as the shards are opened. I think the second read must be getting the database stats, which back then were stored as an entry in the postlist table. These are now stored in the "version file", so that probably means we now avoid that extra read (and if the tables have more levels, there would have been multiple reads for each postlist table to get that information).

comment:17 by Olly Betts, 8 years ago

Milestone: 1.3.41.3.5

Ticket retargeted after milestone closed

comment:18 by Olly Betts, 8 years ago

Resolution: fixed
Status: assignedclosed

I've fixed it so prereading for the record/docdata table only happens if the user calls MSet::fetch() in [52c5b2191308f99695c2a7fd54f79c78c85920d4].

With that, our work here is done, so closing.

Note: See TracTickets for help on using tickets.