Opened 15 years ago
Last modified 20 months ago
#433 new enhancement
MatchSpy should allow early termination
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Matcher | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
I've recently come across some situations where it would be nice to have a MatchSpy at the start of a search, but after the MatchSpy has seen a certain set of values, it's easy to tell that the MatchSpy won't benefit significantly from seeing any further values. This is particularly attractive in a database in which searches are being biased by a document-specific weight, such that the documents with a high weight are first, since the inaccuracy of early termination is often less important in this case. Three types of example:
- A MatchSpy which is checking for the existence of a particular flag in any of the matching documents. (eg, checking to see if any of the matching documents are products which are "special offers"). Once any such document has been seen, the MatchSpy could be discarded.
- A MatchSpy which is checking for facets, but for which only approximate results and no counts are required, could terminate after seeing a specific maximum number of results (say, 1000). We could just set a checkatleast value of 1000 for the search, except that there may be other aspects of the search causing a higher checkatleast value to be required (for example, other facets for which a more exact count is required, or a MatchSpy like that in the example above checking for special offers. Such a MatchSpy needs to see all potential matches, but may terminate at any time.).
- A MatchSpy may only be interested in the documents with the top N weights. For example, when using a diversity reweighting algorithm, it's useful to know the categories of the top N weighted documents in each of several categories (this is similar to a collapse, but we want to ensure that if there are sufficient matching documents, each of the categories has a full complement of N matches found). Such a MatchSpy could report the lowest weight of document that it's currently interested in to the matcher, potentially allowing the min_weight parameter passed to the posting list tree to be increased, rather than us simply passing checkatleast=doccount as is currently needed to get this guarantee.
Currently, such matchspies can simply stop doing any work after they've seen enough information, but it would be more efficient if they could tell the matcher that they are finished, so that the matcher could terminate early.
Several possible implementations spring to mind:
- Change
MatchSpy::operator()
to return a bool; interpret a "true" value as "I'm finished". This would allow case 1 above to be handled nicely. Case 2 could also be handled in this way, but would require an appropriate checkatleast value to be set too. - Change
MatchSpy::operator()
to return a double, indicating the minimum weight that the MatchSpy is now interested in. We could interpret a value ofDBL_MAX
as "I'm finished". This would support case 1 and 2 as before, but also support case 3. If we combine this with the min_weight value used by the existing matcher code, we could also remove the need to set a high checkatleast value for matchspies which need to see lots of documents - the matchspy would simply return 0.0 until it has seen enough documents, at which point it could return a high value instead. - As for implementation (b), but in addition allow MatchSpies to remove themselves from the matcher by a less hacky means than returning
DBL_MAX
. One approach could be to use a similar approach to PostingSources (of registering the source with the matcher using a method on the base class); add a method on the base MatchSpy class to tell the matcher that the matchspy can be removed. Actually, this approach could be used instead of a return value from !MatchSpy::operator() for setting a new minimum weight, too, but it seems likely to be less lightweight than using the return value.
Marking for 1.1.7, since implementing these might require API changes.
Attachments (1)
Change History (10)
comment:1 by , 15 years ago
by , 15 years ago
Attachment: | matchspy_api_changes_experiment.patch added |
---|
Rough sketch of some of the API ideas I was playing with. Untested - may not compile.
comment:2 by , 15 years ago
I think the time has probably passed for trying to cram further new features in before 1.2.0. We should be stabilising the code at this point...
comment:3 by , 15 years ago
Fair enough - that's a vote for implementation style (c), then! :)
(Since that one doesn't require an API change.)
comment:4 by , 15 years ago
(Ignoring the smiley) it's not a vote for any implementation approach. It's a vote for focussing on getting 1.2.0 released, and in particular not being distracted by new features or enhancing existing features when we should be addressing bug reports from users testing 1.1.x and trunk, plus the few existing tickets we've previously agreed should be addressed.
We were talking (albeit rather speculatively) about July 2009 for 1.2.0 at one point, so we're now more than 6 months late.
This particular ticket is at worst an efficiency issue (since the MatchSpy can as you say just do nothing when called). It would be nice to address for 1.2.0 to avoid a possible API change in the future, but we've already punted on tickets which would be at least as useful to address. It isn't obvious what the best approach here is. If we rush to fix it, we increase the risk of being stuck with an API which may later prove problematic. We also risk introducing bugs and further delaying 1.2.0.
Marking MatchSpy as experimental is a possible option, but we don't really want to be making ABI-changes mid-1.2.x as we'll have to carry the weight of both old and new implementations until 1.3.0.
comment:5 by , 15 years ago
Milestone: | 1.1.7 → 1.2.x |
---|
Agreed - I wanted to get this ticket into the tracker now because I've had several conversations along these lines (with a variety of people) over the last several months, and wanted to have a place to refer to when the subject comes up again. I'm not in any particular hurry to get it into Xapian. Marking as 1.2.x for now is fine by me (and if the API needs to be different, it could wait until 1.3.x).
comment:6 by , 13 years ago
Milestone: | 1.2.x → 1.3.x |
---|
Marking for 1.3.x - if we want to make an API change here, early 1.3.x is the time to do it, so we should think about this ticket in the nearish future.
We could add a second virtual operator() with different parameters which gets called by 1.3.x, and a default implementation of this which calls the existing operator() and returns the "don't change anything" value. Then existing MatchSpy subclasses will still compile and work, though with an extra virtual call overhead.
comment:7 by , 10 years ago
Description: | modified (diff) |
---|
comment:9 by , 20 months ago
Milestone: | 1.4.x → 2.0.0 |
---|---|
Version: | SVN trunk → git master |
I've just been playing with potential implementations for this.
Implementation (a) is quite easy to do; there don't seem to be any unexpected complications.
Implementation (b) gets tricky. The natural user assumption, if a matchspy were able to return a minimum weight required, would be that it would be shown all matching documents which have a weight higher than that weight. Currently, the matchspy is just "along for the ride" with the rest of the matcher, so may well not be shown documents even if they have a higher weight than that requested: the only way to avoid this currently is to set a high enough checkatleast value.
If taking the implementation (b) approach, the returned weight is therefore only useful if it is able to override the minweight calculated by the normal matching code, if the matchspy's minweight is lower. In turn, if we changed the behaviour so that that happened, there'd be no way for a matchspy to indicate that it doesn't want to control the documents looked at by the matcher (ie, there'd be no way to obtain the current "along for the ride" behaviour).
My current feeling is that the best approach to get all the features mentioned above would be to leave !MatchSpy::operator() returning void, and add some extra methods to the base class to allow communication with the matcher: in particular, a method to say "I'm finished", a method to indicate the the matchspy wants to see all matching documents returned from the postlist tree, and a method to indicate the minimum weight of document from the postlist tree that the matchspy is interested in. (The latter two methods could quite possibly be combined.)