Opened 15 years ago
Last modified 12 months ago
#394 assigned enhancement
Speed up phrase queries with a "settling pond"
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Matcher | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
The attached patch implements a "settling pond" to delay the checking of exact phrases which are "and-like with the root" (by which I mean if the phrase doesn't match, the whole query doesn't match). We discard pond entries which are below the current min_weight, and when the pond fills up, or the postlist tree is done, we take the highest weighted entries from the pond, which makes it more likely we'll increase min_weight and so be able to discard lower weighted pond entries without needing to do the potentially expensive phrase check.
This patch can dramatically improve query speeds for exact phrase queries for common terms when the positional data isn't cached.
This could be extended to any OP_PHRASE or OP_NEAR check which is "and-like with the root", and also to perform more than one such check.
The patch needs cleaning up in a few places, and the pond size should default to something sane (based on DB size perhaps?) but I'm putting it here to make sure it doesn't get lost or forgotten.
Patch is against trunk r13285.
Attachments (7)
Change History (27)
by , 15 years ago
Attachment: | phrase-settling-pond.patch added |
---|
comment:1 by , 15 years ago
Henry reports on xapian-discuss that:
You may want to seriously consider pushing this patch into trunk since the performance benefit is, well, significant.
My phrase test: from 15s+ down to 1s
comment:2 by , 14 years ago
Priority: | normal → high |
---|
This is a big win for a slow case which several users have hit, so marking as high priority.
comment:3 by , 14 years ago
Milestone: | → 1.2.x |
---|
Oh, and it isn't an API or ABI change, so -> milestone:1.2.x
comment:4 by , 13 years ago
Milestone: | 1.2.x → 1.3.0 |
---|---|
Status: | new → assigned |
by , 13 years ago
Attachment: | phrase-settling-pond-update.patch added |
---|
Patch updated to SVN trunk r16092
comment:5 by , 13 years ago
Patch updated to current SVN trunk. Apart from fixing a comment typo, the changes are just to update it for changes in trunk since the previous patch.
Testsuite is looking good (still running, but it has passed everything so far) but we don't have tests of every matcher feature in combination with a phrase search which is really needed to give us good coverage for this change.
I'd like to finish off and land the query internal reimplementation first, then I'll update this for the changes that brings in, and try to get it merged.
follow-up: 15 comment:6 by , 13 years ago
It occurred to me that we should be able to just make this a PostList subclass which the matcher can read from. It will return out-of-order docids, but that's OK as we already handle that for remote matches. And it'll mean we don't have to complicate the matcher main loop further, which can only be a good thing.
comment:7 by , 13 years ago
Milestone: | 1.3.0 → 1.3.x |
---|
by , 12 years ago
Attachment: | phrase-settling-pond-update-20120911.patch added |
---|
updated for trunk as of 2012-09-11
by , 12 years ago
Attachment: | phrase-settling-pond-update-20120913.patch added |
---|
internally now works with PostList* objects rather than turning then back into strings
comment:8 by , 12 years ago
The latest patch addresses one of the quick-and-dirty aspects of the initial patch.
It passes the testsuite, but I suspect many need attention for the multi-database case (if so test coverage could be improved).
by , 12 years ago
Attachment: | positional-query-weight-check-first.patch added |
---|
simpler optimisation which works in more cases
by , 12 years ago
Attachment: | positional-query-weight-check-first-1.2.patch added |
---|
same patch backported to 1.2
comment:9 by , 12 years ago
I've had some feedback that the simpler patch is effective, so I'll apply that. That "feedbacker" is going to test both patches separately and together, so I'll report the conclusions if/when I hear back.
comment:10 by , 12 years ago
OK, positional-query-weight-check-first.patch applied to trunk (as hence will be in 1.3.1), with an additional tweak not to calculate or check the weight when w_min == 0.0 (so we don't pointlessly calculate a lot of extra weights in cases such as when a large MSet size is requested).
comment:11 by , 12 years ago
Backported that tweaked version of positional-query-weight-check-first.patch for 1.2.14 in r17079.
comment:12 by , 12 years ago
Gustavo Yoshizaki from Avature kindly provided the following benchmark results, and agreed to let me make them public:
These are the four configurations tested: A = No optimization B = Pond size optimization C = Phrase weight check optimization D = Both optimizations Average time in seconds to make more than 4000 searches (the same searches with the same dataset) A: 690.95 B: 318.13 C: 137.99 D: 124.88 Number of times in which the configuration was better than the other three configuration A: 464 B: 728 C: 1489 D: 2183 For the searches where configurations B, C and D is faster than configuration A, the % of speed up B: 62,94 % C: 59,58 % D: 57,87 %
So C is the patch which is in 1.2.14 and will be in 1.3.1. B is the original pond size patch, I believe with the default pond size.
The tl;dr version is that the patch now applied makes significantly more difference alone than the "pond" patch alone does (at least for these searches), but using both together would improve things further still. With the default pond size, by 10% on average, but I bet that's not the optimal size as it was just picked arbitrarily.
comment:13 by , 10 years ago
Milestone: | 1.3.x → 1.4.x |
---|
Shouldn't block 1.4.0.
There have been a number of phrase-related enhancements, so this really needs rechecking to see what effect the pond now has, and to see what pond size works best.
comment:14 by , 9 years ago
Over the last 6 months, the glass backend has seen several changes which reduce the overhead per item in the table, which particularly helps the position table as it has a huge number of entries, many of which are really small.
Since glass will be the default backend in 1.4.0, this is another reason we need fresh benchmarks to decide if the settling pond is still worth pursuing.
follow-up: 17 comment:15 by , 5 years ago
Replying to olly:
It occurred to me that we should be able to just make this a PostList subclass which the matcher can read from. It will return out-of-order docids, but that's OK as we already handle that for remote matches.
This is no longer the case for git master - remote matches are now handled by merging MSet objects rather than reinjecting their contents back into a local matcher loop.
And it'll mean we don't have to complicate the matcher main loop further, which can only be a good thing.
This is certainly still a valid concern.
comment:16 by , 5 years ago
Milestone: | 1.4.x → 1.5.0 |
---|---|
Priority: | high → normal |
Version: | 1.1.2 → git master |
1.4.13 can combine the and-like parts of AND_MAYBE and AND_NOT with other and-like queries which means phrase checks can get hoisted higher in some cases. And the "maybe" part of AND_MAYBE gets hoisted above the phrase check so we can avoid visiting it if the phrase doesn't match.
We really should re-evaluate the settling pond idea in the light of the large number of changes in this area and either implement some version of it or close this ticket. Marking for 1.5.0 for now.
comment:17 by , 5 years ago
Replying to Olly Betts:
Replying to olly:
It occurred to me that we should be able to just make this a PostList subclass which the matcher can read from. It will return out-of-order docids, but that's OK as we already handle that for remote matches.
This is no longer the case for git master - remote matches are now handled by merging MSet objects rather than reinjecting their contents back into a local matcher loop.
However, the (new on master) PostListTree
class which handles the local shards processes them one by one but their docids are interleaved, so it returns docids in non-ascending order in the general case, so I think we can insert the settling pond either just above or just below that and keep the design nice and clean. I think just below is probably the right place since otherwise we'd need to keep track of which PostList
each entry in the pond corresponds to. This would mean the pond had to be drained between processing shards, but that's not unreasonable.
comment:18 by , 2 years ago
I found a somewhat newer version of the patch from 2013 on my dev box so attaching in case it's useful. I think this is probably the version of the patch Gustavo tested.
It would be good to put this ticket to bed, either by concluding that the idea is no longer useful in the light of the numerous improvements to phrase search performance we've made in the meantime, or by merging an updated implementation of the idea.
Looking over the 2013 patch, some things come to mind.
- It's a shame that it only helps when the positional check is "and-like with the root" - e.g.
"some phrase" OR "other words"
isn't helped at all. I can see that a subquery could read ahead a bit and have its own pond, but it gets more complicated with things likecount_matching_subqs()
needing to work for the "current" entry and I'm not sure how feasible it would actually be. Probably best to retry the idea at the top level and if it helps in that case think harder about whether it's doable for a subquery.
- It could probably still use
read_position_list()
instead of creating newPositionList
objects every time, with a little rejigging so that the docid to open for is specifiable. That would reduce memory allocation overhead and mean that we might not have to deal in the term names.
- The data structure for the pond could be improved. Comments in the patch suggest a min/max heap. Essentially we need to be able to efficiently add items, find and remove the entry with the highest weight, and remove all entries below a specified weight; the data structure has a fixed max number of entries which is known in advance
by , 2 years ago
Attachment: | phrase-settling-pond-update-20130517.patch added |
---|
2013 version of pond patch
comment:19 by , 2 years ago
Hmm, phrase-settling-pond-update-20120913.patch
switched pool_terms
to be vector<PostList*>
but the apparently newer 2013 patch is back to vector<std::string>
, so I think the 20120913 version may be the best starting point for further work.
comment:20 by , 12 months ago
Milestone: | 1.5.0 → 2.0.0 |
---|
Postponing - this optimisation could be added in a stable release so doesn't need to block starting a new release series.
Prototype patch