root / tags / 1.0.8 / xapian-core / docs / bm25.html

Revision 9806, 4.7 kB (checked in by olly, 13 months ago)

docs/bm25.html: Improve wording.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<HTML>
3<HEAD>
4<TITLE>Xapian: BM25 Weighting Scheme</TITLE>
5</HEAD>
6<BODY BGCOLOR="white" TEXT="black">
7
8<H1>The BM25 Weighting Scheme</H1>
9
10<p>
11This is a technical note about the BM25 weighting scheme, which is the default
12weighting scheme used by Xapian.  Recent TREC tests have shown BM25 to be the
13best of the known probabilistic weighting schemes.  In case
14you're wondering, the BM simply stands for "Best Match".
15</p>
16
17<p>
18We'll follow the evolution from the traditional probabilistic weighting
19scheme through to BM25.
20</p>
21
22<H2>The Traditional Probabilistic Weighting Scheme</H2>
23
24<p>
25In its most general form, the traditional probabilistic term weighting function
26is:
27</p>
28
29<table border=0><tr valign=center>
30<td>
31<tt><center>
32<u>(k<sub>3</sub>+1)q</u><br>(k<sub>3</sub>+q)</center></tt>
33</td>
34<td><tt>&middot;</tt></td>
35<td>
36<tt><center>
37<u>(k<sub>1</sub>+1)f</u><br>(k<sub>1</sub>L+f)</center></tt>
38</td>
39<td><tt>&middot;log</tt></td>
40<td><tt><center>
41<u>(r+0.5)(N-n-R+r+0.5)</u><br>(n-r+0.5)(R-r+0.5)</center></tt>
42</td>
43<td>
44<tt>
45&nbsp;...(1)
46</tt>
47</td>
48</tr></table>
49<p>where:</p>
50
51<p>
52<tt>k<sub>1</sub>, k<sub>3</sub> are constants.<br>
53q is the wqf, the within query frequency,<br>
54f is the wdf, the within document frequency,<br>
55n is the number of documents in the collection indexed by this term,<br>
56N is the total number of documents in the collection,<br>
57r is the number of relevant documents indexed by this term,<br>
58R is the total number of relevant documents,<br>
59L is the normalised document length (i.e. the length of this document
60divided by the average length of documents in the collection).
61</tt>
62</p>
63
64<p>
65The factors <tt>(k<sub>3</sub> + 1)</tt> and <tt>(k<sub>1</sub> + 1)</tt>
66are unnecessary here, but help scale the weights, so the first component is 1
67when <tt>q&nbsp;=&nbsp;1</tt> etc. But they are critical
68below when we add an extra item to the sum of term weights.
69</p>
70
71<H2>BM11</H2>
72
73<p>
74Stephen Robertson's BM11 uses formula <tt>(1)</tt> for the term weights, but adds
75an extra item to the sum of term weights to give the overall document
76score:
77</p>
78
79<table border=0><tr valign=center>
80<td><tt>k<sub>2</sub> n<sub>q</sub></tt></td>
81<td>
82<tt><center>
83<u>(1-L)</u><br>(1+L)</center></tt>
84</td>
85<td>
86<tt>
87&nbsp;...(2)
88</tt>
89</td>
90</tr></table>
91
92<p>where:</p>
93
94<p>
95<tt>n<sub>q</sub> is the number of terms in the query (the query length),<br>
96k<sub>2</sub> is yet another constant.
97</tt>
98</p>
99
100<p>
101Note that this extra item is zero when <tt>L&nbsp;=&nbsp;1</tt>.
102</p>
103
104<H2>BM15</H2>
105
106<p>BM15 is BM11 with the <tt>k<sub>1</sub>+f</tt> in place of
107<tt>k<sub>1</sub>L+f</tt> in <tt>(1)</tt>.</p>
108
109<H2>BM25</H2>
110
111<p>BM25 combines the B11 and B15 with a scaling factor, b, which turns BM15
112into BM11 as it moves from 0 to 1:
113</p>
114
115<table border=0><tr valign=center>
116<td>
117<tt><center>
118<u>(k<sub>3</sub>+1)q</u><br>(k<sub>3</sub>+q)</center></tt>
119</td>
120<td><tt>&middot;</tt></td>
121<td>
122<tt><center>
123<u>(k<sub>1</sub>+1)f</u><br>(K+f)</center></tt>
124</td>
125<td><tt>&middot;log</tt></td>
126<td><tt><center>
127<u>(r+0.5)(N-n-R+r+0.5)</u><br>(n-r+0.5)(R-r+0.5)</center></tt>
128</td>
129<td>
130<tt>
131&nbsp;...(3)
132</tt>
133</td>
134</tr></table>
135
136<p>where:</p>
137
138<p>
139<tt>K = k<sub>1</sub>(bL + (1-b))</tt>
140</p>
141
142<p>
143BM25 originally introduced another constant, as a power to which f and K are
144raised.  However, Stephen remarks that powers other than 1 were
145<i>'not helpful'</i>, and other tests confirm this, so Xapian's implementation
146of BM25 ignores this.
147</p>
148
149<p>
150<tt>(2)</tt> and <tt>(3)</tt> make up BM25, with which Stephen has had so
151much recent success.
152</p>
153
154<p>
155This does all seem somewhat ad-hoc, with so many unknown constants
156in the formula.  But note that with <tt>k<sub>2</sub>&nbsp;=&nbsp;0</tt>
157and <tt>b&nbsp;=&nbsp;1</tt> we get the traditional formula anyway.
158</p>
159
160<p>
161The default parameter values Xapian uses are
162<tt>k<sub>1</sub>&nbsp;=&nbsp;1</tt>,
163<tt>k<sub>2</sub>&nbsp;=&nbsp;0</tt>,
164<tt>k<sub>3</sub>&nbsp;=&nbsp;1</tt>,
165and <tt>b&nbsp;=&nbsp;0.5</tt>.  These are reasonable defaults, but the optimum
166values will vary with both the documents being searched and the type of
167queries, so you may be able to improve the effectiveness of your search system
168by tuning the values of these parameters.
169</p>
170
171<p>
172In Xapian, we also apply a floor to L (0.5 by default) which
173helps stop tiny documents get ridiculously high weights.  And the matcher
174wants the extra item in the sum to be positive, so we add
175<tt>k<sub>2</sub>n<sub>q</sub></tt>
176(constant for a given query) to (2) to give:
177</p>
178
179<table border=0><tr valign=center>
180<td>
181<tt><center>
182<u>2 k<sub>2</sub> n<sub>q</sub></u><br>(1 + L)</center></tt>
183</td>
184<td>
185<tt>
186&nbsp;...(4)
187</tt>
188</td>
189</tr></table>
190
191<!-- FOOTER $Author$ $Date$ $Id$ -->
192</BODY>
193</HTML>
Note: See TracBrowser for help on using the browser.