Opened 12 years ago

Last modified 4 months ago

#215 new enhancement

Boolean OR could be optimised further

Reported by: richard Owned by: vaibhav
Priority: normal Milestone: 1.4.x
Component: Matcher Version: SVN trunk
Severity: minor Keywords:
Cc: harisaicharan111.challa@…, vaibhavkansagara249@… Blocked By:
Blocking: Operating System: All

Description (last modified by olly)

For a boolean OR of several terms, we don't care how many of the terms match - we just care whether any of them do. Therefore, when we find that one of the terms matches, we shouldn't waste effort trying to move the other posting lists to the same position.

Instead, we should hold the posting lists in order of termfrequency, highest termfrequency first. To perform a skip_to() operation with ID "minid", we'd call skip_to(minid) on each of the sub-postlists until one of them moved to minid. At that point, there would be no need to call skip_t() on the remaining sub-postlists.

We'd need to keep track of which sub-postlists have been moved up to the current position, and which haven't. When next() is called, we'd call next() on any sub-postlists which are up-to-date, but we would need to call skip_to() on any other sub-postlists which are further behind.

We could implement this by introducing special BooleanOrPostingList? to handle this particular case.

I'm not sure that it would lead to much improvement for common cases, but it is easy to make up cases where it would make a huge improvement (eg, a boolean OR in which one of the terms is a MatchAll? term - none of the other postlists would need to be moved in this case.)

Change History (11)

comment:1 Changed 12 years ago by olly

  • Severity changed from normal to enhancement

I don't see a common case this helps (the only "big OR" boolean filters I can think of have disjoint terms - e.g. Ttext/html OR Ttext/pdf, Omega's date range searches) but it would be easy to implement in the queryoptimiser framework so it's worth trying at some point.

comment:2 Changed 12 years ago by olly

  • Operating System set to All
  • Status changed from new to assigned

comment:4 Changed 11 years ago by olly

  • Description modified (diff)
  • Owner changed from newbugs to olly
  • Status changed from assigned to new

Hmm, if weights are coming from an external source, then this becomes more interesting...

comment:5 Changed 11 years ago by richard

PostingSource? has a get_maxweight() method - if this returns 0, or if the PostingSource? is under a "OP_MULT_WEIGHT 0" (or the boolean part of an OP_FILTER), we can ignore the weights returned by it. I don't see why it would be any harder to do this with weights from an external source, therefore.

comment:6 Changed 11 years ago by olly

Sorry, I phrased that badly. I meant to say:

In the case of a boolean OR with weight information coming purely from an external source, then this optimisation becomes potentially much more profitable.

comment:7 Changed 4 years ago by olly

  • Milestone set to 1.4.x

This is a purely internal change, so no API or ABI repercussions, so marking for 1.4.x.

comment:8 Changed 2 years ago by charan

  • Cc harisaicharan111.challa@… added

i want to work on this bug

comment:9 Changed 2 years ago by olly

  • Owner changed from olly to charan

Sure.

comment:10 Changed 6 months ago by olly

754f6bd1ea5c6cd48d692b59296f980920b01df9 added a BoolOrPostList class, but it doesn't currently use the optimisation proposed here.

comment:11 Changed 4 months ago by vaibhav

  • Owner changed from charan to vaibhav

comment:12 Changed 4 months ago by vaibhav

  • Cc vaibhavkansagara249@… added
Note: See TracTickets for help on using tickets.