Opened 16 years ago
Last modified 21 months ago
#378 assigned enhancement
Optimise MultiAndPostList using current weight of lhs
Reported by: | Richard Boulton | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | low | Milestone: | 2.0.0 |
Component: | Matcher | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Currently, MultiAndPostList::next() advances the sub-postlists until they all reach a common document ID, by moving the first postlist forward, and then calling check on all the subsequent postlists, with the document ID that first postlist has moved to. The weights are not calculated until after the postlists have all settled on a position (so the w_min value is not used by MultiAndPostList - though it may be used by the sub-postlists).
Instead, each time the lhs moves to a new position, the current weight of that sub-postlist could be calculated. If this weight is insufficient, when combined with the maxweights of the remaining sub-postlists, to attain w_min(), the lhs could be moved further forward, without calling check() on any of the subsequent posting lists; thus reducing the frequency at which the subsequent posting lists need to be accessed. If the subsequent posting lists are "expensive" to check for some reason (perhaps they're external posting sources, or value ranges), this can speed things up.
This principle could be used in AndMaybePostList too, and possibly to OrPostList.
Speculatively, this idea could be extended, such that the skip_to() and check() methods of posting sources would be given an extra "current_min_weight" parameter, indicating the minimum weight required for the specified document id. This would allow the optimisation to be applied outside within the individual sub-postlist trees of a MultiAndPostList.
I've got a prototype patch to MultiAndPostList, and some timings for a particular query. In most cases, I don't see a noticeable performance difference with the prototype patch, but in some cases, it results in a significant speed up. On the other hand, (though I've not observed it), it's plausible that this patch could slow things down if calculating the weight on the lhs postlist is more expensive than calling check() on the rhs postlist.
Attachments (1)
Change History (12)
by , 16 years ago
comment:1 by , 16 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Best improvement I've seen with this patch: for a search of the form "A AND_MAYBE B AND_MAYBE DecreasingValuePostingSource()", when requesting 300 hits (there are around 120,000 hits matching the search), this patch more than doubles the speed, and results in 760 calls to check() on the PostingSource, instead of around 3000. For this particular search, the posting source weights are large enough that the AND_MAYBE decays to a MultiAnd almost as soon as 300 hits have been seen.
For the same search, when requesting fewer than 200 hits, or more than 1000 hits, there's no easily measurable difference in speed, and the number of calls to check() reduces only by a couple of percent. I've not yet tried it with other searches (though I've tried it with the plain ValuePostingSource, where it also has little effect).
comment:2 by , 16 years ago
I've been trying to think of a "worst case" for this idea so we can try it and measure it to see how it does. If it's not measurably worse than the current approach, we can stop worrying about it!
I'm not sure if it is the worst, but I think a bad case is when we have "a" matching 2,4,6,8,10,...,1000000 and "b" matching 1,3,5,7,9,...,1000001. Then a AND b has a on the lhs, and currently we try check() on b for each position in a, but never find a match. With your patch we calculate the weight for each position in a as well, so the overhead is 500000 weight calculations. I suspect that's measurable compared to the 500000 calls to next() on a and check() on b.
Does this seem a plausible testcase? Can you think of a worse case to try?
(Not always being a win doesn't necessarily sink the idea of course, but it means that it needs more careful consideration, and possibly a heuristic for when to apply this optimisation and when not to - for example, we could have a "cost" for each postlist, set higher for ExternalPostList, etc).
comment:3 by , 16 years ago
It seems to me another bad case is as above, but with "b" being a PostingSource (where check() doesn't leave the list on an entry) which only matches 1 and 1000001.
comment:4 by , 15 years ago
Summary: | Optimise MultiAndPostList using current weight of lhs. → Optimise MultiAndPostList using current weight of lhs |
---|
It seems we need to decide what approach to take with this. Either:
- We're cautious and try checking the worst cases we can think of (those above, and any others) before applying.
- We just apply this and deal with any fallout if/when it is noticed, possibly by simply backing out this change at that point.
The later potentially makes trunk less releasable, but this patch doesn't seem likely to become hard to back out in the medium term (there's no API change for a start), so that's not as bad as it might seem.
Thoughts?
comment:5 by , 15 years ago
In the short term, I'd feel fairly comfortable about applying the patch; though there are definitely cases in which it will reduce performance, the gain in other cases is quite significant. I'm not desperate for it to be applied, though, since in most of the situations I found little change in performance. The uncertainty is probably a good argument for leaving it as it is.
I think that we're reaching the limit of what can be done with matcher optimisations without giving the matcher extra information (such as estimated "costs" for each postlist, as suggested above). If and when we add such things, it should be more obvious which path to take.
The other option is to make a public option on the matcher to select which path to try. Users desperate for low-latency search results, but with plenty of CPU to spare, could then try running the search in parallel using each path, and discard whichever path doesn't complete first.
comment:6 by , 15 years ago
I think major overhauls of the matcher are out of scope for 1.1.x at this point - we don't want to risk destabilising changes, and if anything we're approaching time to start being brutal and culling some things which currently have milestones set for 1.1.x.
I don't think speculatively executing in parallel is a sane approach for most users, and there's a significant risk that the rival searches would compete for I/O and cache resources resulting in a slower result overall anyway.
Thinking more about this, I feel nervous about making this change with no idea of how bad it could make things, so I think it would be useful to try to quantify what slowdown we might see in the worst case.
Do you still have the large database you were testing this on? If so, it would be good to try one or both of the likely "bad cases" above. I can probably even try to implement the required postingsources if you don't have the time, but I don't think I have a suitable database around to test with.
comment:7 by , 15 years ago
Milestone: | 1.1.2 → 1.1.3 |
---|---|
Priority: | normal → low |
Could be done in 1.2.x, but has a patch, so low priority.
The 1.1.2 milestone has gathered more than its share of bugs, so bumping this to 1.1.3 for now in the interests of being able to release 1.1.2 in the not too distant future.
comment:10 by , 10 years ago
Milestone: | 1.3.x → 1.4.x |
---|
This doesn't require API or ABI changes, so bumping to 1.4.x.
comment:11 by , 21 months ago
Milestone: | 1.4.x → 2.0.0 |
---|---|
Version: | SVN trunk → git master |
I'm going to bump the milestone.
This really needs some profiling to determine whether this helps or hinders overall (and particular in cases we're likely to care about).
Richard notes some results above, but the matcher has been rewritten since so even those can't really be relied on as still applicable.
The patch here should apply to AndPostList::find_next_match()
in matcher/andpostlist.cc
without much effort.
The thought above about some sort of cost metrics is probably worth exploring too - we now have a query optimiser, and that could probably do a better job if it knew what was likely to be expensive.
Corrected patch