Opened 8 years ago

Last modified 20 months ago

#394 assigned enhancement

Speed up phrase queries with a "settling pond"

Reported by: olly Owned by: olly
Priority: high Milestone: 1.4.x
Component: Matcher Version: 1.1.2
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

The attached patch implements a "settling pond" to delay the checking of exact phrases which are "and-like with the root" (by which I mean if the phrase doesn't match, the whole query doesn't match). We discard pond entries which are below the current min_weight, and when the pond fills up, or the postlist tree is done, we take the highest weighted entries from the pond, which makes it more likely we'll increase min_weight and so be able to discard lower weighted pond entries without needing to do the potentially expensive phrase check.

This patch can dramatically improve query speeds for exact phrase queries for common terms when the positional data isn't cached.

This could be extended to any OP_PHRASE or OP_NEAR check which is "and-like with the root", and also to perform more than one such check.

The patch needs cleaning up in a few places, and the pond size should default to something sane (based on DB size perhaps?) but I'm putting it here to make sure it doesn't get lost or forgotten.

Patch is against trunk r13285.

Attachments (6)

phrase-settling-pond.patch (23.3 KB) - added by olly 8 years ago.
Prototype patch
phrase-settling-pond-update.patch (23.5 KB) - added by olly 6 years ago.
Patch updated to SVN trunk r16092
phrase-settling-pond-update-20120911.patch (24.6 KB) - added by olly 5 years ago.
updated for trunk as of 2012-09-11
phrase-settling-pond-update-20120913.patch (46.3 KB) - added by olly 5 years ago.
internally now works with PostList?* objects rather than turning then back into strings
positional-query-weight-check-first.patch (2.7 KB) - added by olly 5 years ago.
simpler optimisation which works in more cases
positional-query-weight-check-first-1.2.patch (2.4 KB) - added by olly 5 years ago.
same patch backported to 1.2

Download all attachments as: .zip

Change History (20)

Changed 8 years ago by olly

Prototype patch

comment:1 Changed 8 years ago by richard

Henry reports on xapian-discuss that:

You may want to seriously consider pushing this patch into trunk since the performance benefit is, well, significant.

My phrase test: from 15s+ down to 1s

comment:2 Changed 7 years ago by olly

  • Priority changed from normal to high

This is a big win for a slow case which several users have hit, so marking as high priority.

comment:3 Changed 7 years ago by olly

  • Milestone set to 1.2.x

Oh, and it isn't an API or ABI change, so -> milestone:1.2.x

comment:4 Changed 6 years ago by olly

  • Milestone changed from 1.2.x to 1.3.0
  • Status changed from new to assigned

Changed 6 years ago by olly

Patch updated to SVN trunk r16092

comment:5 Changed 6 years ago by olly

Patch updated to current SVN trunk. Apart from fixing a comment typo, the changes are just to update it for changes in trunk since the previous patch.

Testsuite is looking good (still running, but it has passed everything so far) but we don't have tests of every matcher feature in combination with a phrase search which is really needed to give us good coverage for this change.

I'd like to finish off and land the query internal reimplementation first, then I'll update this for the changes that brings in, and try to get it merged.

comment:6 Changed 6 years ago by olly

It occurred to me that we should be able to just make this a PostList subclass which the matcher can read from. It will return out-of-order docids, but that's OK as we already handle that for remote matches. And it'll mean we don't have to complicate the matcher main loop further, which can only be a good thing.

comment:7 Changed 6 years ago by olly

  • Milestone changed from 1.3.0 to 1.3.x

Changed 5 years ago by olly

updated for trunk as of 2012-09-11

Changed 5 years ago by olly

internally now works with PostList?* objects rather than turning then back into strings

comment:8 Changed 5 years ago by olly

The latest patch addresses one of the quick-and-dirty aspects of the initial patch.

It passes the testsuite, but I suspect many need attention for the multi-database case (if so test coverage could be improved).

Changed 5 years ago by olly

simpler optimisation which works in more cases

Changed 5 years ago by olly

same patch backported to 1.2

comment:9 Changed 5 years ago by olly

I've had some feedback that the simpler patch is effective, so I'll apply that. That "feedbacker" is going to test both patches separately and together, so I'll report the conclusions if/when I hear back.

comment:10 Changed 5 years ago by olly

OK, positional-query-weight-check-first.patch applied to trunk (as hence will be in 1.3.1), with an additional tweak not to calculate or check the weight when w_min == 0.0 (so we don't pointlessly calculate a lot of extra weights in cases such as when a large MSet size is requested).

comment:11 Changed 5 years ago by olly

Backported that tweaked version of positional-query-weight-check-first.patch for 1.2.14 in r17079.

comment:12 Changed 5 years ago by olly

Gustavo Yoshizaki from Avature kindly provided the following benchmark results, and agreed to let me make them public:

These are the four configurations tested:
A = No optimization
B = Pond size optimization
C = Phrase weight check optimization
D = Both optimizations

Average time in seconds to make more than 4000 searches (the same 
searches with the same dataset)
A:    690.95
B:    318.13
C:    137.99
D:    124.88

Number of times in which the configuration was better than the other 
three configuration
A: 464
B: 728
C: 1489
D: 2183

For the searches where configurations B, C and D is faster than 
configuration A, the % of speed up
B: 62,94 %
C: 59,58 %
D: 57,87 %

So C is the patch which is in 1.2.14 and will be in 1.3.1. B is the original pond size patch, I believe with the default pond size.

The tl;dr version is that the patch now applied makes significantly more difference alone than the "pond" patch alone does (at least for these searches), but using both together would improve things further still. With the default pond size, by 10% on average, but I bet that's not the optimal size as it was just picked arbitrarily.

comment:13 Changed 3 years ago by olly

  • Milestone changed from 1.3.x to 1.4.x

Shouldn't block 1.4.0.

There have been a number of phrase-related enhancements, so this really needs rechecking to see what effect the pond now has, and to see what pond size works best.

comment:14 Changed 20 months ago by olly

Over the last 6 months, the glass backend has seen several changes which reduce the overhead per item in the table, which particularly helps the position table as it has a huge number of entries, many of which are really small.

Since glass will be the default backend in 1.4.0, this is another reason we need fresh benchmarks to decide if the settling pond is still worth pursuing.

Note: See TracTickets for help on using tickets.