Opened 10 years ago

Last modified 16 months ago

#378 assigned enhancement

Optimise MultiAndPostList using current weight of lhs

Reported by: richard Owned by: richard
Priority: low Milestone: 1.4.x
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently, MultiAndPostList::next() advances the sub-postlists until they all reach a common document ID, by moving the first postlist forward, and then calling check on all the subsequent postlists, with the document ID that first postlist has moved to. The weights are not calculated until after the postlists have all settled on a position (so the w_min value is not used by MultiAndPostList? - though it may be used by the sub-postlists).

Instead, each time the lhs moves to a new position, the current weight of that sub-postlist could be calculated. If this weight is insufficient, when combined with the maxweights of the remaining sub-postlists, to attain w_min(), the lhs could be moved further forward, without calling check() on any of the subsequent posting lists; thus reducing the frequency at which the subsequent posting lists need to be accessed. If the subsequent posting lists are "expensive" to check for some reason (perhaps they're external posting sources, or value ranges), this can speed things up.

This principle could be used in AndMaybePostList? too, and possibly to OrPostList?.

Speculatively, this idea could be extended, such that the skip_to() and check() methods of posting sources would be given an extra "current_min_weight" parameter, indicating the minimum weight required for the specified document id. This would allow the optimisation to be applied outside within the individual sub-postlist trees of a MultiAndPostList?.

I've got a prototype patch to MultiAndPostList?, and some timings for a particular query. In most cases, I don't see a noticeable performance difference with the prototype patch, but in some cases, it results in a significant speed up. On the other hand, (though I've not observed it), it's plausible that this patch could slow things down if calculating the weight on the lhs postlist is more expensive than calling check() on the rhs postlist.

Attachments (1)

patch (741 bytes) - added by richard 10 years ago.
Corrected patch

Download all attachments as: .zip

Change History (11)

Changed 10 years ago by richard

Corrected patch

comment:1 Changed 10 years ago by richard

  • Owner changed from olly to richard
  • Status changed from new to assigned

Best improvement I've seen with this patch: for a search of the form "A AND_MAYBE B AND_MAYBE DecreasingValuePostingSource?()", when requesting 300 hits (there are around 120,000 hits matching the search), this patch more than doubles the speed, and results in 760 calls to check() on the PostingSource?, instead of around 3000. For this particular search, the posting source weights are large enough that the AND_MAYBE decays to a MultiAnd? almost as soon as 300 hits have been seen.

For the same search, when requesting fewer than 200 hits, or more than 1000 hits, there's no easily measurable difference in speed, and the number of calls to check() reduces only by a couple of percent. I've not yet tried it with other searches (though I've tried it with the plain ValuePostingSource?, where it also has little effect).

comment:2 Changed 10 years ago by olly

I've been trying to think of a "worst case" for this idea so we can try it and measure it to see how it does. If it's not measurably worse than the current approach, we can stop worrying about it!

I'm not sure if it is the worst, but I think a bad case is when we have "a" matching 2,4,6,8,10,...,1000000 and "b" matching 1,3,5,7,9,...,1000001. Then a AND b has a on the lhs, and currently we try check() on b for each position in a, but never find a match. With your patch we calculate the weight for each position in a as well, so the overhead is 500000 weight calculations. I suspect that's measurable compared to the 500000 calls to next() on a and check() on b.

Does this seem a plausible testcase? Can you think of a worse case to try?

(Not always being a win doesn't necessarily sink the idea of course, but it means that it needs more careful consideration, and possibly a heuristic for when to apply this optimisation and when not to - for example, we could have a "cost" for each postlist, set higher for ExternalPostList, etc).

comment:3 Changed 10 years ago by olly

It seems to me another bad case is as above, but with "b" being a PostingSource (where check() doesn't leave the list on an entry) which only matches 1 and 1000001.

comment:4 Changed 10 years ago by olly

  • Summary changed from Optimise MultiAndPostList using current weight of lhs. to Optimise MultiAndPostList using current weight of lhs

It seems we need to decide what approach to take with this. Either:

  • We're cautious and try checking the worst cases we can think of (those above, and any others) before applying.
  • We just apply this and deal with any fallout if/when it is noticed, possibly by simply backing out this change at that point.

The later potentially makes trunk less releasable, but this patch doesn't seem likely to become hard to back out in the medium term (there's no API change for a start), so that's not as bad as it might seem.

Thoughts?

comment:5 Changed 10 years ago by richard

In the short term, I'd feel fairly comfortable about applying the patch; though there are definitely cases in which it will reduce performance, the gain in other cases is quite significant. I'm not desperate for it to be applied, though, since in most of the situations I found little change in performance. The uncertainty is probably a good argument for leaving it as it is.

I think that we're reaching the limit of what can be done with matcher optimisations without giving the matcher extra information (such as estimated "costs" for each postlist, as suggested above). If and when we add such things, it should be more obvious which path to take.

The other option is to make a public option on the matcher to select which path to try. Users desperate for low-latency search results, but with plenty of CPU to spare, could then try running the search in parallel using each path, and discard whichever path doesn't complete first.

comment:6 Changed 10 years ago by olly

I think major overhauls of the matcher are out of scope for 1.1.x at this point - we don't want to risk destabilising changes, and if anything we're approaching time to start being brutal and culling some things which currently have milestones set for 1.1.x.

I don't think speculatively executing in parallel is a sane approach for most users, and there's a significant risk that the rival searches would compete for I/O and cache resources resulting in a slower result overall anyway.

Thinking more about this, I feel nervous about making this change with no idea of how bad it could make things, so I think it would be useful to try to quantify what slowdown we might see in the worst case.

Do you still have the large database you were testing this on? If so, it would be good to try one or both of the likely "bad cases" above. I can probably even try to implement the required postingsources if you don't have the time, but I don't think I have a suitable database around to test with.

comment:7 Changed 10 years ago by olly

  • Milestone changed from 1.1.2 to 1.1.3
  • Priority changed from normal to low

Could be done in 1.2.x, but has a patch, so low priority.

The 1.1.2 milestone has gathered more than its share of bugs, so bumping this to 1.1.3 for now in the interests of being able to release 1.1.2 in the not too distant future.

comment:8 Changed 10 years ago by olly

  • Milestone changed from 1.1.3 to 1.2.0

Bumping to stay on track for 1.2.0.

comment:9 Changed 7 years ago by olly

  • Milestone changed from 1.2.x to 1.3.x

This isn't 1.2.x material now.

comment:10 Changed 4 years ago by olly

  • Milestone changed from 1.3.x to 1.4.x

This doesn't require API or ABI changes, so bumping to 1.4.x.

Note: See TracTickets for help on using tickets.