Opened 18 years ago

Closed 18 years ago

Last modified 18 years ago

#153 closed defect (released)

remote database iterator performance is slow

Reported by: Mark Hammond Owned by: Richard Boulton
Priority: normal Milestone:
Component: Backend-Remote Version: SVN trunk
Severity: normal Keywords:
Cc: Olly Betts, Richard Boulton, Sidnei da Silva Blocked By:
Blocking: #120, #154 Operating System: All

Description

I've some (python) test code that creates a database with 2500 documents, ~170000 terms, and then performs a 'partial' search against it. It takes around 15 seconds to complete, with (on Windows) xapian-tcpsrv.exe and the 'system' process chewing all CPU.

It turns out this is the query parser fetching an 'allterms' iterator. Each term is being delivered in its own message, and this is the process taking all the time (ie, the actual query is fast)

It seems that we probably want to optimize the transfer of some iterators - I believe something similar has recently been done by Richard, but I don't recall the details.

But even when the transfer is optimized, we'd probably still like to avoid sending the entire term list each time a query is parsed. Ideally, we would be able to cache the enumerators while the database remains open - IIUC, a DB (version) is guaranteed not to change while open, so such a cache should be safe. Either way, I'll wait for feedback before speculating further...

Attachments (4)

remdbtest.tgz (705 bytes ) - added by Richard Boulton 18 years ago.
Tarball containing simple test case
xapian-allprefixedterms-wildcard.2.patch (24.3 KB ) - added by Richard Boulton 18 years ago.
Updated patch to fix this issue
xapian-allprefixedterms-wildcard.patch (24.3 KB ) - added by Richard Boulton 18 years ago.
Fix, by adding a prefix parameter to allterms
remote-protocol-minor-version.patch (3.9 KB ) - added by Olly Betts 18 years ago.
Patch to implement minor protocol versions

Download all attachments as: .zip

Change History (15)

comment:1 by Richard Boulton, 18 years ago

The query parser uses an allterms iterator to implement wildcards and partial queries. In both of these cases, it usually only wants a subset of the data, so I believe the correct solution is both to optimise the handling of chunks of the data, and to implement the skip_to method().

comment:2 by Richard Boulton, 18 years ago

Owner: changed from Olly Betts to Richard Boulton

The alternative solution would be to move the query parsing to the network server, so that the postlist doesn't need to be serialised across the network. This might be the more efficient solution, and an easier fix.

comment:3 by Olly Betts, 18 years ago

Cc: olly@… added
rep_platform: PCAll

You can't simply move the query parsing across to the remote server - that's not going to work for the case where you're searching multiple databases as you're merging several alltermsiterators there.

Wildcarding and partial probably shouldn't be expanded in the queryparser anyway, so you might be able to handle them remotely if there's only one remote server. But finding a way to address this that helps the general case would be better.

Remote termlists could wait for the first operation and if it's skip_to start at that point. That would help this case a lot, and is probably a fairly simple change.

by Richard Boulton, 18 years ago

Attachment: remdbtest.tgz added

Tarball containing simple test case

comment:4 by Olly Betts, 18 years ago

Blocking: 120 added

Richard has a patch, and it sounds like he thinks it's suitable for 1.0.X:

<richardb> A networked wildcard query parse now takes about the same time as a local wildcard query parse. <richardb> As long as there aren't an insanely large number of matching terms. <richardb> Just checking that all the tests pass, and then I'll attach a patch to the bug report. <richardb> It won't make it for 1.0.0, but we can add it fairly shortly after.

comment:5 by Richard Boulton, 18 years ago

On thing missing from the patch I've just attached is support for the allterm iterator methods with a prefix in the Ruby bindings - I took a quick look at it, but I don't speak Ruby enough to know how to fix it.

by Richard Boulton, 18 years ago

Updated patch to fix this issue

comment:6 by Richard Boulton, 18 years ago

attachments.isobsolete: 01

comment:7 by Richard Boulton, 18 years ago

Blocking: 154 added
Status: newassigned

Proposing that the patch for this bug goes into 1.0.1:

Pro: it's a fix for a genuine user performance problem.

Pro: it has a reasonably good test case (though could always be made better).

Pro: it seems a clean design to me.

Pro: it satisfies the conditions that it doesn't break API or ABI (though it does add to them).

Con: it touches quite a few files, though the changes are relatively minor in each one.

Con: it changes the API - are we sure we want to go with the API as implemented in this bug? Perhaps we should allow iteration between two arbitrary end points. But iteration between two arbitrary end points isn't precluded by this solution - we could overload the allterms_begin() method further, to support a two-argument form.

Con: it changes the network protocol.

So - comments?

comment:8 by Olly Betts, 18 years ago

Overall, I think I'd like to see how 1.0.0 measures up before finally deciding on this (or any other patch with possible "cons"). If the problem reports start rolling in, I think 1.0.1 should be more conservative. If 1.0.0 is great, we can reasonably be more adventurous sooner.

Pro: it's a fix for a genuine user performance problem.

Although it's in a brand new feature, and the current user (Enfold) need a patched tree currently anyway for other reasons.

Con: it touches quite a few files, though the changes are relatively minor in each one.

I think it's fairly safe from a quick read of the patch. I've not looked at it in the full context though.

Con: it changes the API - are we sure we want to go with the API as implemented in this bug?

It seems the obvious API, though I'd like to think about it a bit more.

Con: it changes the network protocol.

I'd prefer to coordinate such changes where possible, as I noted in bug#154. Having to upgrade clients and servers simultaneously requires extra work, so bumping the protocol version in every release is more annoying to users.

In this case, an old client can actually safely use a new server (but not vice versa) since it will always send an empty prefix, but that's hard to implement since we'd need to patch 1.0.0 to know that protocol version 30 (or whatever) is also OK. This holds for a lot of changes, since we often add new REPLY_* values which are only sent in response to new MSG_* values. Perhaps the next protocol bump should also add a secondary version which we can bump for client compatible changes. Then clients can check server_primary == MY_PRIMARY and server_secondary >= MY_SECONDARY.

by Olly Betts, 18 years ago

Patch to implement minor protocol versions

comment:9 by Richard Boulton, 18 years ago

attachments.isobsolete: 01

(From update of attachment 87) Applied

comment:10 by Richard Boulton, 18 years ago

attachments.isobsolete: 01

(From update of attachment 88) Applied

comment:11 by Richard Boulton, 18 years ago

Resolution: fixed
Status: assignedclosed

Marking as fixed - the patches are applied to HEAD, and the test case now reports that the wildcard query parsing takes 0:00.02elapsed time, with HEAD.

comment:12 by Olly Betts, 18 years ago

Operating System: All
Resolution: fixedreleased

Fixed in 1.0.1.

Note: See TracTickets for help on using tickets.