Ticket #215 (new enhancement)
Boolean OR could be optimised further
| Reported by: | richard | Owned by: | olly |
|---|---|---|---|
| Priority: | normal | Milestone: | |
| Component: | Matcher | Version: | SVN trunk |
| Severity: | minor | Keywords: | |
| Cc: | Blocked By: | ||
| Operating System: | All | Blocking: |
Description (last modified by olly) (diff)
For a boolean OR of several terms, we don't care how many of the terms match - we just care whether any of them do. Therefore, when we find that one of the terms matches, we shouldn't waste effort trying to move the other posting lists to the same position.
Instead, we should hold the posting lists in order of termfrequency, highest termfrequency first. To perform a skip_to() operation with ID "minid", we'd call skip_to(minid) on each of the sub-postlists until one of them moved to minid. At that point, there would be no need to call skip_t() on the remaining sub-postlists.
We'd need to keep track of which sub-postlists have been moved up to the current position, and which haven't. When next() is called, we'd call next() on any sub-postlists which are up-to-date, but we would need to call skip_to() on any other sub-postlists which are further behind.
We could implement this by introducing special BooleanOrPostingList? to handle this particular case.
I'm not sure that it would lead to much improvement for common cases, but it is easy to make up cases where it would make a huge improvement (eg, a boolean OR in which one of the terms is a MatchAll? term - none of the other postlists would need to be moved in this case.)
