| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>Xapian: Matcher Design Notes</TITLE> |
|---|
| 5 | </HEAD> |
|---|
| 6 | <BODY BGCOLOR="white" TEXT="black"> |
|---|
| 7 | |
|---|
| 8 | <H1>Matcher Design Notes</H1> |
|---|
| 9 | |
|---|
| 10 | <P>This document is incomplete at present. It lacks explanation of the min-heap |
|---|
| 11 | used to keep the best N M-set items (Managing Gigabytes describes this |
|---|
| 12 | technique well). |
|---|
| 13 | |
|---|
| 14 | <h2>The PostList Tree</h2> |
|---|
| 15 | |
|---|
| 16 | <P>The matcher builds a tree structure from the query. This tree is a binary |
|---|
| 17 | tree, and each node is a PostList sub-class object. This is more efficient than |
|---|
| 18 | a n-ary tree in terms of the number of comparisons which need to be performed: |
|---|
| 19 | <insert proof> (but this proof may only be valid for equal sized posting |
|---|
| 20 | lists |
|---|
| 21 | without optimisations, in which case there may be a more efficient way to do |
|---|
| 22 | this - investigate!) |
|---|
| 23 | |
|---|
| 24 | <P>To built the tree, a PostList object is |
|---|
| 25 | created for each term, and pairs of PostLists are combined using |
|---|
| 26 | 2-way branching tree elements for AND, OR, etc - these are virtual |
|---|
| 27 | PostLists whose class names reflect the operation (AndPostList, |
|---|
| 28 | OrPostList, etc). See below for a full list. |
|---|
| 29 | |
|---|
| 30 | <P>The tree is deliberately built in an uneven way, such that we minimise the |
|---|
| 31 | likely number of |
|---|
| 32 | times a posting has to be passed up a level. For a group of OR operations, |
|---|
| 33 | the PostLists with fewest entries are furthest down the group's subtree, |
|---|
| 34 | minimising the amount of information needing to be passed up the tree. |
|---|
| 35 | For a group of AND operations the PostLists with most entries are furthest down, |
|---|
| 36 | so we look at the least frequent terms first, and skip the posting lists of |
|---|
| 37 | the others. This will generally minimise the number of posting list entries |
|---|
| 38 | we read and maximises the size of each skip_to. The OR tree is built up |
|---|
| 39 | in a similar way to how an optimal huffman code is constructed. This is |
|---|
| 40 | provably optimal, but with the assumption that the tree structure is |
|---|
| 41 | immutable once created. If term distribution is uneven, rebalancing this |
|---|
| 42 | tree during the match might be more efficient. |
|---|
| 43 | |
|---|
| 44 | <h2>running the match</h2> |
|---|
| 45 | |
|---|
| 46 | <P> |
|---|
| 47 | Once the tree is built, the matcher repeatedly asks |
|---|
| 48 | the root of the tree for the next matching document and compares it to |
|---|
| 49 | those in the proto-mset it maintains. If the next matching document scores |
|---|
| 50 | more highly (either by weight, or in sort order if sorting is used) then it |
|---|
| 51 | adds it and discards the lowest scoring document. |
|---|
| 52 | |
|---|
| 53 | <P> |
|---|
| 54 | When one of a sub-tree of AND operations runs out, it signals "end of list", |
|---|
| 55 | and each AND signals this too. |
|---|
| 56 | |
|---|
| 57 | <P> |
|---|
| 58 | When an OR gets end of list, it autoprunes, replacing itself with the |
|---|
| 59 | branch that still has postings - see below for full details. If the matcher |
|---|
| 60 | itself gets "end of list", the match is complete. |
|---|
| 61 | |
|---|
| 62 | <P> |
|---|
| 63 | The other operations also handle end of list in one of these two ways |
|---|
| 64 | (for asymmetric operations, which happens may depend which branch has run out). |
|---|
| 65 | |
|---|
| 66 | <P> |
|---|
| 67 | The matcher also passes the lowest weight currently needed make the proto-mset |
|---|
| 68 | into the tree, and each node may adjust this weight and pass it on to its |
|---|
| 69 | subtrees. Each PostList can report a minimum weight it could contribute - so if |
|---|
| 70 | the left branch of an AND will always return a weight of 2 or more, |
|---|
| 71 | then if the whole AND needs to return at least 6, the right branch is |
|---|
| 72 | told it need to return at least 4. |
|---|
| 73 | |
|---|
| 74 | <P> |
|---|
| 75 | For example, an OR knows that if its left branch can |
|---|
| 76 | contribute at most a weight of 4 and its right branch at most 7, then if |
|---|
| 77 | the minimum weight is 8, only documents matching both branches are now |
|---|
| 78 | of interest so it mutates into an AND. If the minimum weight is 6 it |
|---|
| 79 | changes into an AND_MAYBE (A AND_MAYBE B matches documents which which |
|---|
| 80 | match A, but B contributes to the weight - in most search engines query |
|---|
| 81 | syntax, that's expressed as `+A B'). See the "Operator Decay" section |
|---|
| 82 | below for full details of these mutations. If the minimum weight needed is |
|---|
| 83 | 12, no document is good enough, and the OR returns "end of list". |
|---|
| 84 | |
|---|
| 85 | <h2>Phrase and near matching</h2> |
|---|
| 86 | |
|---|
| 87 | <P> |
|---|
| 88 | The way phrase and near matching works is to perform an AND query for all the |
|---|
| 89 | terms, with a filter node in front which only returns documents whose |
|---|
| 90 | positional information fulfils the phrase requirements. |
|---|
| 91 | |
|---|
| 92 | <P>Unfortunately this creates a |
|---|
| 93 | bad case is where a lot of documents have the words of the phrase in |
|---|
| 94 | but few match the actual phrase - this filter does a lot of work, but |
|---|
| 95 | the matcher can't stop early. We should look at hoisting the filtering |
|---|
| 96 | part higher up the tree, but note that this may not always be a win. |
|---|
| 97 | Some heuristics are probably required. |
|---|
| 98 | |
|---|
| 99 | <h2>virtual postlist types</h2> |
|---|
| 100 | |
|---|
| 101 | <P>There are several types of virtual PostList. Each type can be treated as |
|---|
| 102 | boolean or probabilistic - the only difference is whether the weights are |
|---|
| 103 | ignored or not. The types are: |
|---|
| 104 | |
|---|
| 105 | <ul> |
|---|
| 106 | <li> OrPostList: returns documents which match either branch |
|---|
| 107 | |
|---|
| 108 | <li> AndPostList: returns documents which match both branches |
|---|
| 109 | |
|---|
| 110 | <li> XorPostList: returns documents which match one branch or the other but not both |
|---|
| 111 | |
|---|
| 112 | <li> AndNotPostList: returns documents which match the left branch, but not the |
|---|
| 113 | right (the weights of documents from the right branch are ignored). |
|---|
| 114 | |
|---|
| 115 | <li> AndMaybePostList: returns documents which match the left branch - weights from |
|---|
| 116 | documents also in the right branch are added in for the probabilistic case |
|---|
| 117 | ("X ANDMAYBE Y" can be expressed as "+X Y" in Altavista). |
|---|
| 118 | |
|---|
| 119 | <li> FilterPostList: applies the right branch as a boolean filter to the left |
|---|
| 120 | branch (which is typically a probabilistic query. Note: same as |
|---|
| 121 | AndPostList with the right branch weights ignored. |
|---|
| 122 | </ul> |
|---|
| 123 | |
|---|
| 124 | <p>[Note: You can use AndNotPostList to apply an inverted boolean filter to a |
|---|
| 125 | probabilistic query] |
|---|
| 126 | |
|---|
| 127 | <p>All the symmetric operators (i.e. OR, AND, XOR) are coding for maximum |
|---|
| 128 | efficiency when the right branch has fewer postings in than the left branch. |
|---|
| 129 | |
|---|
| 130 | <p>There are 2 main optimisations which the best match performs: autoprune and |
|---|
| 131 | operator decay. |
|---|
| 132 | |
|---|
| 133 | <h2>autoprune</h2> |
|---|
| 134 | |
|---|
| 135 | <P>For example, if a branch in the match tree is "A OR B", when A runs out then |
|---|
| 136 | "A OR B" is replaced by "B". Similar reductions occur for XOR, ANDNOT, and |
|---|
| 137 | ANDMAYBE (if the right branch runs out). Other operators (AND, FILTER, and |
|---|
| 138 | ANDMAYBE (when the left branch runs out) simply return "at_end" and this is |
|---|
| 139 | dealt with somewhere further up the tree as appropriate. |
|---|
| 140 | |
|---|
| 141 | <P>An autoprune is indicated by the next or skip_to method returning a pointer |
|---|
| 142 | to the PostList object to replace the postlist being read with. |
|---|
| 143 | |
|---|
| 144 | <h2>operator decay</h2> |
|---|
| 145 | |
|---|
| 146 | <P>The matcher tracks the minimum weight needed for a document to make it into |
|---|
| 147 | the m-set (this decreases monotonically as the m-set forms). This can be |
|---|
| 148 | used to replace on boolean operator with a stricter one. E.g. consider A OR |
|---|
| 149 | B - when maxweight(A) < minweight and maxweight(B) < minweight then only |
|---|
| 150 | documents matching both A and B can make it into the m-set so we can replace |
|---|
| 151 | the OR with an AND. Operator decay is flagged using the same mechanism as |
|---|
| 152 | autoprune, by returning the replacement operator from next or skip_to. |
|---|
| 153 | |
|---|
| 154 | <p>Possible decays: |
|---|
| 155 | |
|---|
| 156 | <ul> |
|---|
| 157 | <li> OR -> AND |
|---|
| 158 | <li> OR -> ANDMAYBE |
|---|
| 159 | <li> ANDMAYBE -> AND |
|---|
| 160 | <li> XOR -> ANDNOT |
|---|
| 161 | </ul> |
|---|
| 162 | |
|---|
| 163 | <p>A related optimisation is that the Match object may terminate early if |
|---|
| 164 | maxweight for the whole tree is less than the smallest weight in the mset. |
|---|
| 165 | |
|---|
| 166 | <!-- FOOTER $Author$ $Date$ $Id$ --> |
|---|
| 167 | </BODY> |
|---|
| 168 | </HTML> |
|---|