Opened 15 years ago

Last modified 4 months ago

#394 assigned enhancement

Speed up phrase queries with a "settling pond"

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Matcher Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

The attached patch implements a "settling pond" to delay the checking of exact phrases which are "and-like with the root" (by which I mean if the phrase doesn't match, the whole query doesn't match). We discard pond entries which are below the current min_weight, and when the pond fills up, or the postlist tree is done, we take the highest weighted entries from the pond, which makes it more likely we'll increase min_weight and so be able to discard lower weighted pond entries without needing to do the potentially expensive phrase check.

This patch can dramatically improve query speeds for exact phrase queries for common terms when the positional data isn't cached.

This could be extended to any OP_PHRASE or OP_NEAR check which is "and-like with the root", and also to perform more than one such check.

The patch needs cleaning up in a few places, and the pond size should default to something sane (based on DB size perhaps?) but I'm putting it here to make sure it doesn't get lost or forgotten.

Patch is against trunk r13285.

Attachments (7)

phrase-settling-pond.patch (23.3 KB ) - added by Olly Betts 15 years ago.
Prototype patch
phrase-settling-pond-update.patch (23.5 KB ) - added by Olly Betts 12 years ago.
Patch updated to SVN trunk r16092
phrase-settling-pond-update-20120911.patch (24.6 KB ) - added by Olly Betts 12 years ago.
updated for trunk as of 2012-09-11
phrase-settling-pond-update-20120913.patch (46.3 KB ) - added by Olly Betts 12 years ago.
internally now works with PostList* objects rather than turning then back into strings
positional-query-weight-check-first.patch (2.7 KB ) - added by Olly Betts 12 years ago.
simpler optimisation which works in more cases
positional-query-weight-check-first-1.2.patch (2.4 KB ) - added by Olly Betts 12 years ago.
same patch backported to 1.2
phrase-settling-pond-update-20130517.patch (30.7 KB ) - added by Olly Betts 23 months ago.
2013 version of pond patch

Download all attachments as: .zip

Change History (27)

by Olly Betts, 15 years ago

Attachment: phrase-settling-pond.patch added

Prototype patch

comment:1 by Richard Boulton, 14 years ago

Henry reports on xapian-discuss that:

You may want to seriously consider pushing this patch into trunk since the performance benefit is, well, significant.

My phrase test: from 15s+ down to 1s

comment:2 by Olly Betts, 14 years ago

Priority: normalhigh

This is a big win for a slow case which several users have hit, so marking as high priority.

comment:3 by Olly Betts, 14 years ago

Milestone: 1.2.x

Oh, and it isn't an API or ABI change, so -> milestone:1.2.x

comment:4 by Olly Betts, 13 years ago

Milestone: 1.2.x1.3.0
Status: newassigned

by Olly Betts, 12 years ago

Patch updated to SVN trunk r16092

comment:5 by Olly Betts, 12 years ago

Patch updated to current SVN trunk. Apart from fixing a comment typo, the changes are just to update it for changes in trunk since the previous patch.

Testsuite is looking good (still running, but it has passed everything so far) but we don't have tests of every matcher feature in combination with a phrase search which is really needed to give us good coverage for this change.

I'd like to finish off and land the query internal reimplementation first, then I'll update this for the changes that brings in, and try to get it merged.

comment:6 by Olly Betts, 12 years ago

It occurred to me that we should be able to just make this a PostList subclass which the matcher can read from. It will return out-of-order docids, but that's OK as we already handle that for remote matches. And it'll mean we don't have to complicate the matcher main loop further, which can only be a good thing.

comment:7 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

by Olly Betts, 12 years ago

updated for trunk as of 2012-09-11

by Olly Betts, 12 years ago

internally now works with PostList* objects rather than turning then back into strings

comment:8 by Olly Betts, 12 years ago

The latest patch addresses one of the quick-and-dirty aspects of the initial patch.

It passes the testsuite, but I suspect many need attention for the multi-database case (if so test coverage could be improved).

by Olly Betts, 12 years ago

simpler optimisation which works in more cases

by Olly Betts, 12 years ago

same patch backported to 1.2

comment:9 by Olly Betts, 11 years ago

I've had some feedback that the simpler patch is effective, so I'll apply that. That "feedbacker" is going to test both patches separately and together, so I'll report the conclusions if/when I hear back.

comment:10 by Olly Betts, 11 years ago

OK, positional-query-weight-check-first.patch applied to trunk (as hence will be in 1.3.1), with an additional tweak not to calculate or check the weight when w_min == 0.0 (so we don't pointlessly calculate a lot of extra weights in cases such as when a large MSet size is requested).

comment:11 by Olly Betts, 11 years ago

Backported that tweaked version of positional-query-weight-check-first.patch for 1.2.14 in r17079.

comment:12 by Olly Betts, 11 years ago

Gustavo Yoshizaki from Avature kindly provided the following benchmark results, and agreed to let me make them public:

These are the four configurations tested:
A = No optimization
B = Pond size optimization
C = Phrase weight check optimization
D = Both optimizations

Average time in seconds to make more than 4000 searches (the same 
searches with the same dataset)
A:    690.95
B:    318.13
C:    137.99
D:    124.88

Number of times in which the configuration was better than the other 
three configuration
A: 464
B: 728
C: 1489
D: 2183

For the searches where configurations B, C and D is faster than 
configuration A, the % of speed up
B: 62,94 %
C: 59,58 %
D: 57,87 %

So C is the patch which is in 1.2.14 and will be in 1.3.1. B is the original pond size patch, I believe with the default pond size.

The tl;dr version is that the patch now applied makes significantly more difference alone than the "pond" patch alone does (at least for these searches), but using both together would improve things further still. With the default pond size, by 10% on average, but I bet that's not the optimal size as it was just picked arbitrarily.

comment:13 by Olly Betts, 9 years ago

Milestone: 1.3.x1.4.x

Shouldn't block 1.4.0.

There have been a number of phrase-related enhancements, so this really needs rechecking to see what effect the pond now has, and to see what pond size works best.

comment:14 by Olly Betts, 8 years ago

Over the last 6 months, the glass backend has seen several changes which reduce the overhead per item in the table, which particularly helps the position table as it has a huge number of entries, many of which are really small.

Since glass will be the default backend in 1.4.0, this is another reason we need fresh benchmarks to decide if the settling pond is still worth pursuing.

in reply to:  6 ; comment:15 by Olly Betts, 5 years ago

Replying to olly:

It occurred to me that we should be able to just make this a PostList subclass which the matcher can read from. It will return out-of-order docids, but that's OK as we already handle that for remote matches.

This is no longer the case for git master - remote matches are now handled by merging MSet objects rather than reinjecting their contents back into a local matcher loop.

And it'll mean we don't have to complicate the matcher main loop further, which can only be a good thing.

This is certainly still a valid concern.

comment:16 by Olly Betts, 5 years ago

Milestone: 1.4.x1.5.0
Priority: highnormal
Version: 1.1.2git master

1.4.13 can combine the and-like parts of AND_MAYBE and AND_NOT with other and-like queries which means phrase checks can get hoisted higher in some cases. And the "maybe" part of AND_MAYBE gets hoisted above the phrase check so we can avoid visiting it if the phrase doesn't match.

We really should re-evaluate the settling pond idea in the light of the large number of changes in this area and either implement some version of it or close this ticket. Marking for 1.5.0 for now.

in reply to:  15 comment:17 by Olly Betts, 4 years ago

Replying to Olly Betts:

Replying to olly:

It occurred to me that we should be able to just make this a PostList subclass which the matcher can read from. It will return out-of-order docids, but that's OK as we already handle that for remote matches.

This is no longer the case for git master - remote matches are now handled by merging MSet objects rather than reinjecting their contents back into a local matcher loop.

However, the (new on master) PostListTree class which handles the local shards processes them one by one but their docids are interleaved, so it returns docids in non-ascending order in the general case, so I think we can insert the settling pond either just above or just below that and keep the design nice and clean. I think just below is probably the right place since otherwise we'd need to keep track of which PostList each entry in the pond corresponds to. This would mean the pond had to be drained between processing shards, but that's not unreasonable.

comment:18 by Olly Betts, 23 months ago

I found a somewhat newer version of the patch from 2013 on my dev box so attaching in case it's useful. I think this is probably the version of the patch Gustavo tested.

It would be good to put this ticket to bed, either by concluding that the idea is no longer useful in the light of the numerous improvements to phrase search performance we've made in the meantime, or by merging an updated implementation of the idea.

Looking over the 2013 patch, some things come to mind.

  • It's a shame that it only helps when the positional check is "and-like with the root" - e.g. "some phrase" OR "other words" isn't helped at all. I can see that a subquery could read ahead a bit and have its own pond, but it gets more complicated with things like count_matching_subqs() needing to work for the "current" entry and I'm not sure how feasible it would actually be. Probably best to retry the idea at the top level and if it helps in that case think harder about whether it's doable for a subquery.
  • It could probably still use read_position_list() instead of creating new PositionList objects every time, with a little rejigging so that the docid to open for is specifiable. That would reduce memory allocation overhead and mean that we might not have to deal in the term names.
  • The data structure for the pond could be improved. Comments in the patch suggest a min/max heap. Essentially we need to be able to efficiently add items, find and remove the entry with the highest weight, and remove all entries below a specified weight; the data structure has a fixed max number of entries which is known in advance
Last edited 23 months ago by Olly Betts (previous) (diff)

by Olly Betts, 23 months ago

2013 version of pond patch

comment:19 by Olly Betts, 23 months ago

Hmm, phrase-settling-pond-update-20120913.patch switched pool_terms to be vector<PostList*> but the apparently newer 2013 patch is back to vector<std::string>, so I think the 20120913 version may be the best starting point for further work.

comment:20 by Olly Betts, 4 months ago

Milestone: 1.5.02.0.0

Postponing - this optimisation could be added in a stable release so doesn't need to block starting a new release series.

Note: See TracTickets for help on using tickets.