Opened 16 years ago

Last modified 7 years ago

#304 new enhancement

Calculate LHS weight earlier in AND_MAYBE to avoid calling RHS in some cases

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone:
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently, the next() operation for AND_MAYBE is implemented such that the LHS is advanced to its next position, and then the RHS is skipped forward to that position. Each of these moves is supplied with a minweight of the difference between the minweight supplied to the AndMaybePostList::next() call and the maxweight of the other side.

However, once the LHS has settled on a position, we could instead ask the LHS for its weight, and pass a tighter minweight value to the RHS based on the actual weight of the LHS, rather than its maxweight. This tighter bound would only be valid for the actual document being supplied to the skip_to(), so we'd need to use a new method on postlists to support this.

The same kind of optimisation might be applicable in other situations; the problem is that we need to know when its worth calculating the weight for the postlist we've just moved (which may involve a disk access to get the document length, for example), rather than trying to move the RHS.

Change History (3)

comment:1 by Olly Betts, 15 years ago

Since the tighter bound only applies to the current docid, we can't do most of our optimisations based on it.

This new "skip_to2" method seems to have the semantics:

skip_to2(did, min_wt) advances to docid did if that document is in this posting list and has weight at least min_wt. Otherwise it advances to the first document in the posting list with document id > did.

We can't advance beyond the document after did since that document might require a lower minimum weight contribution than min_wt.

What optimisation(s) does this actually allow? I'm failing to see much gain here...

comment:2 by Richard Boulton, 15 years ago

My thinking was that this optimisation would be helpful in situations where tight bounds on the weight of the RHS for parts of the posting list, are known. For example, if we kept wdf bounds for each chunk of the posting source, it would be possible to avoid having to check the existence of a posting if the weight was below the bound for the relevant chunk of the posting source.

Also, if the RHS was a phrase operator, it is possible to determine the weight of the phrase before checking if the phrase actually existed - this could avoid needing to access the positional information in some cases.

The new method would actually be a check2() method, rather than a skip_to2() method, I think - if the document-specific minweight supplied to it was above the known maxweight for the chunk of the posting list, the check2() method would simply return false, to indicate that the document is known to not be weighted high enough.

comment:3 by Olly Betts, 7 years ago

In [b290278a85e7d4c953450a499cdf7e64f6172574] I reimplemented AndMaybePostList. In the new implementation, next() calls check() on the RHS instead of skip_to(), so for a phrase RHS we now only check if the RHS matches the current docid from the LHS - we won't keep advancing the RHS if it doesn't match there.

That doesn't render this ticket obsolete though - a tighter bound for the RHS which is only correct for current document is potentially useful for allowing the RHS to give a short-cut "no" early, such as in this situation:

Also, if the RHS was a phrase operator, it is possible to determine the weight of the phrase before checking if the phrase actually existed - this could avoid needing to access the positional information in some cases.

There are other cases too - an OR subquery on the RHS could determine that it can't match with enough weight for a given document after seeing that a subset of its subqueries don't match it.

And indeed this probably would make more sense as a variant on check() rather than skip_to() (or perhaps an extra parameter on check().

Note: See TracTickets for help on using tickets.