root / tags / 1.0.8 / xapian-core / docs / matcherdesign.html

Revision 8160, 7.0 kB (checked in by olly, 21 months ago)

* ./: svn:eol-style not svn:eolstyle.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<HTML>
3<HEAD>
4<TITLE>Xapian: Matcher Design Notes</TITLE>
5</HEAD>
6<BODY BGCOLOR="white" TEXT="black">
7
8<H1>Matcher Design Notes</H1>
9
10<P>This document is incomplete at present.  It lacks explanation of the min-heap
11used to keep the best N M-set items (Managing Gigabytes describes this
12technique well).
13
14<h2>The PostList Tree</h2>
15
16<P>The matcher builds a tree structure from the query.  This tree is a binary
17tree, and each node is a PostList sub-class object.  This is more efficient than
18a n-ary tree in terms of the number of comparisons which need to be performed:
19&lt;insert proof&gt; (but this proof may only be valid for equal sized posting
20lists
21without optimisations, in which case there may be a more efficient way to do
22this - investigate!)
23
24<P>To built the tree, a PostList object is
25created for each term, and pairs of PostLists are combined using
262-way branching tree elements for AND, OR, etc - these are virtual
27PostLists whose class names reflect the operation (AndPostList,
28OrPostList, etc).  See below for a full list.
29
30<P>The tree is deliberately built in an uneven way, such that we minimise the
31likely number of
32times a posting has to be passed up a level.  For a group of OR operations,
33the PostLists with fewest entries are furthest down the group's subtree,
34minimising the amount of information needing to be passed up the tree.
35For a group of AND operations the PostLists with most entries are furthest down,
36so we look at the least frequent terms first, and skip the posting lists of
37the others.  This will generally minimise the number of posting list entries
38we read and maximises the size of each skip_to.  The OR tree is built up
39in a similar way to how an optimal huffman code is constructed.  This is
40provably optimal, but with the assumption that the tree structure is
41immutable once created.  If term distribution is uneven, rebalancing this
42tree during the match might be more efficient.
43
44<h2>running the match</h2>
45
46<P>
47Once the tree is built, the matcher repeatedly asks
48the root of the tree for the next matching document and compares it to
49those in the proto-mset it maintains.  If the next matching document scores
50more highly (either by weight, or in sort order if sorting is used) then it
51adds it and discards the lowest scoring document.
52
53<P>
54When one of a sub-tree of AND operations runs out, it signals "end of list",
55and each AND signals this too.
56
57<P>
58When an OR gets end of list, it autoprunes, replacing itself with the
59branch that still has postings - see below for full details.  If the matcher
60itself gets "end of list", the match is complete.
61
62<P>
63The other operations also handle end of list in one of these two ways
64(for asymmetric operations, which happens may depend which branch has run out).
65
66<P>
67The matcher also passes the lowest weight currently needed make the proto-mset
68into the tree, and each node may adjust this weight and pass it on to its
69subtrees.  Each PostList can report a minimum weight it could contribute - so if
70the left branch of an AND will always return a weight of 2 or more,
71then if the whole AND needs to return at least 6, the right branch is
72told it need to return at least 4.
73
74<P>
75For example, an OR knows that if its left branch can
76contribute at most a weight of 4 and its right branch at most 7, then if
77the minimum weight is 8, only documents matching both branches are now
78of interest so it mutates into an AND.  If the minimum weight is 6 it
79changes into an AND_MAYBE (A AND_MAYBE B matches documents which which
80match A, but B contributes to the weight - in most search engines query
81syntax, that's expressed as `+A B').  See the "Operator Decay" section
82below for full details of these mutations.  If the minimum weight needed is
8312, no document is good enough, and the OR returns "end of list".
84
85<h2>Phrase and near matching</h2>
86
87<P>
88The way phrase and near matching works is to perform an AND query for all the
89terms, with a filter node in front which only returns documents whose
90positional information fulfils the phrase requirements.
91
92<P>Unfortunately this creates a
93bad case is where a lot of documents have the words of the phrase in
94but few match the actual phrase - this filter does a lot of work, but
95the matcher can't stop early.  We should look at hoisting the filtering
96part higher up the tree, but note that this may not always be a win.
97Some heuristics are probably required.
98
99<h2>virtual postlist types</h2>
100
101<P>There are several types of virtual PostList.  Each type can be treated as
102boolean or probabilistic - the only difference is whether the weights are
103ignored or not.  The types are:
104
105<ul>
106<li> OrPostList: returns documents which match either branch
107
108<li> AndPostList: returns documents which match both branches
109
110<li> XorPostList: returns documents which match one branch or the other but not both
111
112<li> AndNotPostList: returns documents which match the left branch, but not the
113 right (the weights of documents from the right branch are ignored).
114
115<li> AndMaybePostList: returns documents which match the left branch - weights from
116 documents also in the right branch are added in for the probabilistic case
117 ("X ANDMAYBE Y" can be expressed as "+X Y" in Altavista).
118
119<li> FilterPostList: applies the right branch as a boolean filter to the left
120 branch (which is typically a probabilistic query.  Note: same as
121 AndPostList with the right branch weights ignored.
122</ul>
123
124<p>[Note: You can use AndNotPostList to apply an inverted boolean filter to a
125probabilistic query]
126
127<p>All the symmetric operators (i.e. OR, AND, XOR) are coding for maximum
128efficiency when the right branch has fewer postings in than the left branch.
129
130<p>There are 2 main optimisations which the best match performs: autoprune and
131operator decay.
132
133<h2>autoprune</h2>
134
135<P>For example, if a branch in the match tree is "A OR B", when A runs out then
136"A OR B" is replaced by "B".  Similar reductions occur for XOR, ANDNOT, and
137ANDMAYBE (if the right branch runs out).  Other operators (AND, FILTER, and
138ANDMAYBE (when the left branch runs out) simply return "at_end" and this is
139dealt with somewhere further up the tree as appropriate.
140
141<P>An autoprune is indicated by the next or skip_to method returning a pointer
142to the PostList object to replace the postlist being read with.
143
144<h2>operator decay</h2>
145
146<P>The matcher tracks the minimum weight needed for a document to make it into
147the m-set (this decreases monotonically as the m-set forms).  This can be
148used to replace on boolean operator with a stricter one.  E.g. consider A OR
149B - when maxweight(A) &lt; minweight and maxweight(B) &lt; minweight then only
150documents matching both A and B can make it into the m-set so we can replace
151the OR with an AND.  Operator decay is flagged using the same mechanism as
152autoprune, by returning the replacement operator from next or skip_to.
153
154<p>Possible decays:
155
156<ul>
157<li> OR -&gt; AND
158<li> OR -&gt; ANDMAYBE
159<li> ANDMAYBE -&gt; AND
160<li> XOR -&gt; ANDNOT
161</ul>
162
163<p>A related optimisation is that the Match object may terminate early if
164maxweight for the whole tree is less than the smallest weight in the mset.
165
166<!-- FOOTER $Author$ $Date$ $Id$ -->
167</BODY>
168</HTML>
Note: See TracBrowser for help on using the browser.