Opened 16 years ago

Last modified 13 months ago

#215 new enhancement

Boolean OR could be optimised further

Reported by: Richard Boulton Owned by: Vaibhav Kansagara
Priority: normal Milestone: 2.0.0
Component: Matcher Version: git master
Severity: minor Keywords:
Cc: harisaicharan111.challa@…, vaibhavkansagara249@… Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

For a boolean OR of several terms, we don't care how many of the terms match - we just care whether any of them do. Therefore, when we find that one of the terms matches, we shouldn't waste effort trying to move the other posting lists to the same position.

Instead, we should hold the posting lists in order of termfrequency, highest termfrequency first. To perform a skip_to() operation with ID "minid", we'd call skip_to(minid) on each of the sub-postlists until one of them moved to minid. At that point, there would be no need to call skip_t() on the remaining sub-postlists.

We'd need to keep track of which sub-postlists have been moved up to the current position, and which haven't. When next() is called, we'd call next() on any sub-postlists which are up-to-date, but we would need to call skip_to() on any other sub-postlists which are further behind.

We could implement this by introducing special BooleanOrPostingList to handle this particular case.

I'm not sure that it would lead to much improvement for common cases, but it is easy to make up cases where it would make a huge improvement (eg, a boolean OR in which one of the terms is a MatchAll term - none of the other postlists would need to be moved in this case.)

Change History (15)

comment:1 by Olly Betts, 16 years ago

Severity: normalenhancement

I don't see a common case this helps (the only "big OR" boolean filters I can think of have disjoint terms - e.g. Ttext/html OR Ttext/pdf, Omega's date range searches) but it would be easy to implement in the queryoptimiser framework so it's worth trying at some point.

comment:2 by Olly Betts, 16 years ago

Operating System: All
Status: newassigned

comment:4 by Olly Betts, 16 years ago

Description: modified (diff)
Owner: changed from New Bugs to Olly Betts
Status: assignednew

Hmm, if weights are coming from an external source, then this becomes more interesting...

comment:5 by Richard Boulton, 16 years ago

PostingSource has a get_maxweight() method - if this returns 0, or if the PostingSource is under a "OP_MULT_WEIGHT 0" (or the boolean part of an OP_FILTER), we can ignore the weights returned by it. I don't see why it would be any harder to do this with weights from an external source, therefore.

comment:6 by Olly Betts, 16 years ago

Sorry, I phrased that badly. I meant to say:

In the case of a boolean OR with weight information coming purely from an external source, then this optimisation becomes potentially much more profitable.

comment:7 by Olly Betts, 9 years ago

Milestone: 1.4.x

This is a purely internal change, so no API or ABI repercussions, so marking for 1.4.x.

comment:8 by charan, 7 years ago

Cc: harisaicharan111.challa@… added

i want to work on this bug

comment:9 by Olly Betts, 7 years ago

Owner: changed from Olly Betts to charan

Sure.

comment:10 by Olly Betts, 5 years ago

754f6bd1ea5c6cd48d692b59296f980920b01df9 added a BoolOrPostList class, but it doesn't currently use the optimisation proposed here.

comment:11 by Vaibhav Kansagara, 5 years ago

Owner: changed from charan to Vaibhav Kansagara

comment:12 by Vaibhav Kansagara, 5 years ago

Cc: vaibhavkansagara249@… added

comment:13 by Olly Betts, 4 years ago

A somewhat common case where we get a potentially large boolean OR is for a wildcard weighted by OP_SYNONYM (the default).

comment:14 by Olly Betts, 2 years ago

A somewhat common case where we get a potentially large boolean OR is for a wildcard weighted by OP_SYNONYM (the default).

Except that in this case we do need all the subpostlists to be moved to the matching document so that the SynonymPostList can call get_wdf().

comment:15 by Olly Betts, 19 months ago

I'm not sure that it would lead to much improvement for common cases, but it is easy to make up cases where it would make a huge improvement (eg, a boolean OR in which one of the terms is a MatchAll term - none of the other postlists would need to be moved in this case.)

While the optimisation here is worth trying, this particular case would actually be better handled by the query optimiser just reducing it to a MatchAll query. I tested and that doesn't currently happen.

Vaibhav: Are you still interested in working on this? If not, we should unassign.

Last edited 19 months ago by Olly Betts (previous) (diff)

comment:16 by Olly Betts, 13 months ago

Milestone: 1.4.x2.0.0
Version: SVN trunkgit master
Note: See TracTickets for help on using tickets.