Opened 13 years ago

Last modified 15 months ago

#215 new enhancement

Boolean OR could be optimised further

Reported by: Richard Boulton Owned by: Vaibhav Kansagara
Priority: normal Milestone: 1.4.x
Component: Matcher Version: SVN trunk
Severity: minor Keywords:
Cc: harisaicharan111.challa@…, vaibhavkansagara249@… Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

For a boolean OR of several terms, we don't care how many of the terms match - we just care whether any of them do. Therefore, when we find that one of the terms matches, we shouldn't waste effort trying to move the other posting lists to the same position.

Instead, we should hold the posting lists in order of termfrequency, highest termfrequency first. To perform a skip_to() operation with ID "minid", we'd call skip_to(minid) on each of the sub-postlists until one of them moved to minid. At that point, there would be no need to call skip_t() on the remaining sub-postlists.

We'd need to keep track of which sub-postlists have been moved up to the current position, and which haven't. When next() is called, we'd call next() on any sub-postlists which are up-to-date, but we would need to call skip_to() on any other sub-postlists which are further behind.

We could implement this by introducing special BooleanOrPostingList to handle this particular case.

I'm not sure that it would lead to much improvement for common cases, but it is easy to make up cases where it would make a huge improvement (eg, a boolean OR in which one of the terms is a MatchAll term - none of the other postlists would need to be moved in this case.)

Change History (11)

comment:1 by Olly Betts, 13 years ago

Severity: normalenhancement

I don't see a common case this helps (the only "big OR" boolean filters I can think of have disjoint terms - e.g. Ttext/html OR Ttext/pdf, Omega's date range searches) but it would be easy to implement in the queryoptimiser framework so it's worth trying at some point.

comment:2 by Olly Betts, 13 years ago

Operating System: All
Status: newassigned

comment:4 by Olly Betts, 12 years ago

Description: modified (diff)
Owner: changed from New Bugs to Olly Betts
Status: assignednew

Hmm, if weights are coming from an external source, then this becomes more interesting...

comment:5 by Richard Boulton, 12 years ago

PostingSource has a get_maxweight() method - if this returns 0, or if the PostingSource is under a "OP_MULT_WEIGHT 0" (or the boolean part of an OP_FILTER), we can ignore the weights returned by it. I don't see why it would be any harder to do this with weights from an external source, therefore.

comment:6 by Olly Betts, 12 years ago

Sorry, I phrased that badly. I meant to say:

In the case of a boolean OR with weight information coming purely from an external source, then this optimisation becomes potentially much more profitable.

comment:7 by Olly Betts, 5 years ago

Milestone: 1.4.x

This is a purely internal change, so no API or ABI repercussions, so marking for 1.4.x.

comment:8 by charan, 3 years ago

Cc: harisaicharan111.challa@… added

i want to work on this bug

comment:9 by Olly Betts, 3 years ago

Owner: changed from Olly Betts to charan


comment:10 by Olly Betts, 17 months ago

754f6bd1ea5c6cd48d692b59296f980920b01df9 added a BoolOrPostList class, but it doesn't currently use the optimisation proposed here.

comment:11 by Vaibhav Kansagara, 15 months ago

Owner: changed from charan to Vaibhav Kansagara

comment:12 by Vaibhav Kansagara, 15 months ago

Cc: vaibhavkansagara249@… added
Note: See TracTickets for help on using tickets.