Opened 17 years ago
Last modified 20 months ago
#215 new enhancement
Boolean OR could be optimised further
Reported by: | Richard Boulton | Owned by: | Vaibhav Kansagara |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Matcher | Version: | git master |
Severity: | minor | Keywords: | |
Cc: | harisaicharan111.challa@…, vaibhavkansagara249@… | Blocked By: | |
Blocking: | Operating System: | All |
Description (last modified by )
For a boolean OR of several terms, we don't care how many of the terms match - we just care whether any of them do. Therefore, when we find that one of the terms matches, we shouldn't waste effort trying to move the other posting lists to the same position.
Instead, we should hold the posting lists in order of termfrequency, highest termfrequency first. To perform a skip_to() operation with ID "minid", we'd call skip_to(minid) on each of the sub-postlists until one of them moved to minid. At that point, there would be no need to call skip_t() on the remaining sub-postlists.
We'd need to keep track of which sub-postlists have been moved up to the current position, and which haven't. When next() is called, we'd call next() on any sub-postlists which are up-to-date, but we would need to call skip_to() on any other sub-postlists which are further behind.
We could implement this by introducing special BooleanOrPostingList to handle this particular case.
I'm not sure that it would lead to much improvement for common cases, but it is easy to make up cases where it would make a huge improvement (eg, a boolean OR in which one of the terms is a MatchAll term - none of the other postlists would need to be moved in this case.)
Change History (15)
comment:1 by , 17 years ago
Severity: | normal → enhancement |
---|
comment:2 by , 17 years ago
Operating System: | → All |
---|---|
Status: | new → assigned |
comment:4 by , 17 years ago
Description: | modified (diff) |
---|---|
Owner: | changed from | to
Status: | assigned → new |
Hmm, if weights are coming from an external source, then this becomes more interesting...
comment:5 by , 17 years ago
PostingSource has a get_maxweight() method - if this returns 0, or if the PostingSource is under a "OP_MULT_WEIGHT 0" (or the boolean part of an OP_FILTER), we can ignore the weights returned by it. I don't see why it would be any harder to do this with weights from an external source, therefore.
comment:6 by , 17 years ago
Sorry, I phrased that badly. I meant to say:
In the case of a boolean OR with weight information coming purely from an external source, then this optimisation becomes potentially much more profitable.
comment:7 by , 10 years ago
Milestone: | → 1.4.x |
---|
This is a purely internal change, so no API or ABI repercussions, so marking for 1.4.x.
comment:10 by , 6 years ago
754f6bd1ea5c6cd48d692b59296f980920b01df9 added a BoolOrPostList
class, but it doesn't currently use the optimisation proposed here.
comment:11 by , 6 years ago
Owner: | changed from | to
---|
comment:12 by , 6 years ago
Cc: | added |
---|
comment:13 by , 4 years ago
A somewhat common case where we get a potentially large boolean OR is for a wildcard weighted by OP_SYNONYM (the default).
comment:14 by , 2 years ago
A somewhat common case where we get a potentially large boolean OR is for a wildcard weighted by OP_SYNONYM (the default).
Except that in this case we do need all the subpostlists to be moved to the matching document so that the SynonymPostList
can call get_wdf()
.
comment:15 by , 2 years ago
I'm not sure that it would lead to much improvement for common cases, but it is easy to make up cases where it would make a huge improvement (eg, a boolean OR in which one of the terms is a
MatchAll
term - none of the other postlists would need to be moved in this case.)
While the optimisation here is worth trying, this particular case would actually be better handled by the query optimiser just reducing it to a MatchAll
query. I tested and that doesn't currently happen.
Vaibhav: Are you still interested in working on this? If not, we should unassign.
comment:16 by , 20 months ago
Milestone: | 1.4.x → 2.0.0 |
---|---|
Version: | SVN trunk → git master |
I don't see a common case this helps (the only "big OR" boolean filters I can think of have disjoint terms - e.g. Ttext/html OR Ttext/pdf, Omega's date range searches) but it would be easy to implement in the queryoptimiser framework so it's worth trying at some point.