Opened 10 years ago

Last modified 8 weeks ago

#394 assigned enhancement

Speed up phrase queries with a "settling pond"

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: Matcher Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

The attached patch implements a "settling pond" to delay the checking of exact phrases which are "and-like with the root" (by which I mean if the phrase doesn't match, the whole query doesn't match). We discard pond entries which are below the current min_weight, and when the pond fills up, or the postlist tree is done, we take the highest weighted entries from the pond, which makes it more likely we'll increase min_weight and so be able to discard lower weighted pond entries without needing to do the potentially expensive phrase check.

This patch can dramatically improve query speeds for exact phrase queries for common terms when the positional data isn't cached.

This could be extended to any OP_PHRASE or OP_NEAR check which is "and-like with the root", and also to perform more than one such check.

The patch needs cleaning up in a few places, and the pond size should default to something sane (based on DB size perhaps?) but I'm putting it here to make sure it doesn't get lost or forgotten.

Patch is against trunk r13285.

Attachments (6)

phrase-settling-pond.patch (23.3 KB ) - added by Olly Betts 10 years ago.
Prototype patch
phrase-settling-pond-update.patch (23.5 KB ) - added by Olly Betts 8 years ago.
Patch updated to SVN trunk r16092
phrase-settling-pond-update-20120911.patch (24.6 KB ) - added by Olly Betts 7 years ago.
updated for trunk as of 2012-09-11
phrase-settling-pond-update-20120913.patch (46.3 KB ) - added by Olly Betts 7 years ago.
internally now works with PostList* objects rather than turning then back into strings
positional-query-weight-check-first.patch (2.7 KB ) - added by Olly Betts 7 years ago.
simpler optimisation which works in more cases
positional-query-weight-check-first-1.2.patch (2.4 KB ) - added by Olly Betts 7 years ago.
same patch backported to 1.2

Download all attachments as: .zip

Change History (22)

by Olly Betts, 10 years ago

Attachment: phrase-settling-pond.patch added

Prototype patch

comment:1 by Richard Boulton, 10 years ago

Henry reports on xapian-discuss that:

You may want to seriously consider pushing this patch into trunk since the performance benefit is, well, significant.

My phrase test: from 15s+ down to 1s

comment:2 by Olly Betts, 10 years ago

Priority: normalhigh

This is a big win for a slow case which several users have hit, so marking as high priority.

comment:3 by Olly Betts, 10 years ago

Milestone: 1.2.x

Oh, and it isn't an API or ABI change, so -> milestone:1.2.x

comment:4 by Olly Betts, 8 years ago

Milestone: 1.2.x1.3.0
Status: newassigned

by Olly Betts, 8 years ago

Patch updated to SVN trunk r16092

comment:5 by Olly Betts, 8 years ago

Patch updated to current SVN trunk. Apart from fixing a comment typo, the changes are just to update it for changes in trunk since the previous patch.

Testsuite is looking good (still running, but it has passed everything so far) but we don't have tests of every matcher feature in combination with a phrase search which is really needed to give us good coverage for this change.

I'd like to finish off and land the query internal reimplementation first, then I'll update this for the changes that brings in, and try to get it merged.

comment:6 by Olly Betts, 8 years ago

It occurred to me that we should be able to just make this a PostList subclass which the matcher can read from. It will return out-of-order docids, but that's OK as we already handle that for remote matches. And it'll mean we don't have to complicate the matcher main loop further, which can only be a good thing.

comment:7 by Olly Betts, 8 years ago

Milestone: 1.3.01.3.x

by Olly Betts, 7 years ago

updated for trunk as of 2012-09-11

by Olly Betts, 7 years ago

internally now works with PostList* objects rather than turning then back into strings

comment:8 by Olly Betts, 7 years ago

The latest patch addresses one of the quick-and-dirty aspects of the initial patch.

It passes the testsuite, but I suspect many need attention for the multi-database case (if so test coverage could be improved).

by Olly Betts, 7 years ago

simpler optimisation which works in more cases

by Olly Betts, 7 years ago

same patch backported to 1.2

comment:9 by Olly Betts, 7 years ago

I've had some feedback that the simpler patch is effective, so I'll apply that. That "feedbacker" is going to test both patches separately and together, so I'll report the conclusions if/when I hear back.

comment:10 by Olly Betts, 7 years ago

OK, positional-query-weight-check-first.patch applied to trunk (as hence will be in 1.3.1), with an additional tweak not to calculate or check the weight when w_min == 0.0 (so we don't pointlessly calculate a lot of extra weights in cases such as when a large MSet size is requested).

comment:11 by Olly Betts, 7 years ago

Backported that tweaked version of positional-query-weight-check-first.patch for 1.2.14 in r17079.

comment:12 by Olly Betts, 7 years ago

Gustavo Yoshizaki from Avature kindly provided the following benchmark results, and agreed to let me make them public:

These are the four configurations tested:
A = No optimization
B = Pond size optimization
C = Phrase weight check optimization
D = Both optimizations

Average time in seconds to make more than 4000 searches (the same 
searches with the same dataset)
A:    690.95
B:    318.13
C:    137.99
D:    124.88

Number of times in which the configuration was better than the other 
three configuration
A: 464
B: 728
C: 1489
D: 2183

For the searches where configurations B, C and D is faster than 
configuration A, the % of speed up
B: 62,94 %
C: 59,58 %
D: 57,87 %

So C is the patch which is in 1.2.14 and will be in 1.3.1. B is the original pond size patch, I believe with the default pond size.

The tl;dr version is that the patch now applied makes significantly more difference alone than the "pond" patch alone does (at least for these searches), but using both together would improve things further still. With the default pond size, by 10% on average, but I bet that's not the optimal size as it was just picked arbitrarily.

comment:13 by Olly Betts, 5 years ago

Milestone: 1.3.x1.4.x

Shouldn't block 1.4.0.

There have been a number of phrase-related enhancements, so this really needs rechecking to see what effect the pond now has, and to see what pond size works best.

comment:14 by Olly Betts, 4 years ago

Over the last 6 months, the glass backend has seen several changes which reduce the overhead per item in the table, which particularly helps the position table as it has a huge number of entries, many of which are really small.

Since glass will be the default backend in 1.4.0, this is another reason we need fresh benchmarks to decide if the settling pond is still worth pursuing.

in reply to:  6 comment:15 by Olly Betts, 8 weeks ago

Replying to olly:

It occurred to me that we should be able to just make this a PostList subclass which the matcher can read from. It will return out-of-order docids, but that's OK as we already handle that for remote matches.

This is no longer the case for git master - remote matches are now handled by merging MSet objects rather than reinjecting their contents back into a local matcher loop.

And it'll mean we don't have to complicate the matcher main loop further, which can only be a good thing.

This is certainly still a valid concern.

comment:16 by Olly Betts, 8 weeks ago

Milestone: 1.4.x1.5.0
Priority: highnormal
Version: 1.1.2git master

1.4.13 can combine the and-like parts of AND_MAYBE and AND_NOT with other and-like queries which means phrase checks can get hoisted higher in some cases. And the "maybe" part of AND_MAYBE gets hoisted above the phrase check so we can avoid visiting it if the phrase doesn't match.

We really should re-evaluate the settling pond idea in the light of the large number of changes in this area and either implement some version of it or close this ticket. Marking for 1.5.0 for now.

Note: See TracTickets for help on using tickets.