#153 closed defect (released)
remote database iterator performance is slow
Reported by: | Mark Hammond | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Backend-Remote | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Olly Betts, Richard Boulton, Sidnei da Silva | Blocked By: | |
Blocking: | #120, #154 | Operating System: | All |
Description
I've some (python) test code that creates a database with 2500 documents, ~170000 terms, and then performs a 'partial' search against it. It takes around 15 seconds to complete, with (on Windows) xapian-tcpsrv.exe and the 'system' process chewing all CPU.
It turns out this is the query parser fetching an 'allterms' iterator. Each term is being delivered in its own message, and this is the process taking all the time (ie, the actual query is fast)
It seems that we probably want to optimize the transfer of some iterators - I believe something similar has recently been done by Richard, but I don't recall the details.
But even when the transfer is optimized, we'd probably still like to avoid sending the entire term list each time a query is parsed. Ideally, we would be able to cache the enumerators while the database remains open - IIUC, a DB (version) is guaranteed not to change while open, so such a cache should be safe. Either way, I'll wait for feedback before speculating further...
Attachments (4)
Change History (15)
comment:1 by , 18 years ago
comment:2 by , 18 years ago
Owner: | changed from | to
---|
The alternative solution would be to move the query parsing to the network server, so that the postlist doesn't need to be serialised across the network. This might be the more efficient solution, and an easier fix.
comment:3 by , 18 years ago
Cc: | added |
---|---|
rep_platform: | PC → All |
You can't simply move the query parsing across to the remote server - that's not going to work for the case where you're searching multiple databases as you're merging several alltermsiterators there.
Wildcarding and partial probably shouldn't be expanded in the queryparser anyway, so you might be able to handle them remotely if there's only one remote server. But finding a way to address this that helps the general case would be better.
Remote termlists could wait for the first operation and if it's skip_to start at that point. That would help this case a lot, and is probably a fairly simple change.
comment:4 by , 18 years ago
Blocking: | 120 added |
---|
Richard has a patch, and it sounds like he thinks it's suitable for 1.0.X:
<richardb> A networked wildcard query parse now takes about the same time as a local wildcard query parse. <richardb> As long as there aren't an insanely large number of matching terms. <richardb> Just checking that all the tests pass, and then I'll attach a patch to the bug report. <richardb> It won't make it for 1.0.0, but we can add it fairly shortly after.
comment:5 by , 18 years ago
On thing missing from the patch I've just attached is support for the allterm iterator methods with a prefix in the Ruby bindings - I took a quick look at it, but I don't speak Ruby enough to know how to fix it.
by , 18 years ago
Attachment: | xapian-allprefixedterms-wildcard.patch added |
---|
Updated patch to fix this issue
comment:6 by , 18 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:7 by , 18 years ago
Blocking: | 154 added |
---|---|
Status: | new → assigned |
Proposing that the patch for this bug goes into 1.0.1:
Pro: it's a fix for a genuine user performance problem.
Pro: it has a reasonably good test case (though could always be made better).
Pro: it seems a clean design to me.
Pro: it satisfies the conditions that it doesn't break API or ABI (though it does add to them).
Con: it touches quite a few files, though the changes are relatively minor in each one.
Con: it changes the API - are we sure we want to go with the API as implemented in this bug? Perhaps we should allow iteration between two arbitrary end points. But iteration between two arbitrary end points isn't precluded by this solution - we could overload the allterms_begin() method further, to support a two-argument form.
Con: it changes the network protocol.
So - comments?
comment:8 by , 18 years ago
Overall, I think I'd like to see how 1.0.0 measures up before finally deciding on this (or any other patch with possible "cons"). If the problem reports start rolling in, I think 1.0.1 should be more conservative. If 1.0.0 is great, we can reasonably be more adventurous sooner.
Pro: it's a fix for a genuine user performance problem.
Although it's in a brand new feature, and the current user (Enfold) need a patched tree currently anyway for other reasons.
Con: it touches quite a few files, though the changes are relatively minor in each one.
I think it's fairly safe from a quick read of the patch. I've not looked at it in the full context though.
Con: it changes the API - are we sure we want to go with the API as implemented in this bug?
It seems the obvious API, though I'd like to think about it a bit more.
Con: it changes the network protocol.
I'd prefer to coordinate such changes where possible, as I noted in bug#154. Having to upgrade clients and servers simultaneously requires extra work, so bumping the protocol version in every release is more annoying to users.
In this case, an old client can actually safely use a new server (but not vice versa) since it will always send an empty prefix, but that's hard to implement since we'd need to patch 1.0.0 to know that protocol version 30 (or whatever) is also OK. This holds for a lot of changes, since we often add new REPLY_* values which are only sent in response to new MSG_* values. Perhaps the next protocol bump should also add a secondary version which we can bump for client compatible changes. Then clients can check server_primary == MY_PRIMARY and server_secondary >= MY_SECONDARY.
by , 18 years ago
Attachment: | remote-protocol-minor-version.patch added |
---|
Patch to implement minor protocol versions
comment:11 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Marking as fixed - the patches are applied to HEAD, and the test case now reports that the wildcard query parsing takes 0:00.02elapsed time, with HEAD.
The query parser uses an allterms iterator to implement wildcards and partial queries. In both of these cases, it usually only wants a subset of the data, so I believe the correct solution is both to optimise the handling of chunks of the data, and to implement the skip_to method().