root / tags / 1.0.8 / xapian-core / docs / intro_ir.html

Revision 10789, 30.2 kB (checked in by olly, 6 months ago)

Backport change from trunk:
docs/glossary.rst,docs/intro_ir.html: Improve intro_ir a bit, and
link to the definition of RSet in the glossary.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<HTML>
3<HEAD>
4<TITLE>Xapian: Theoretical Background</TITLE>
5</HEAD>
6<BODY BGCOLOR="white">
7<H1>Theoretical Background</H1>
8
9<p>This document aims to provide some theoretical background to Xapian.</p>
10
11<H2>Documents and terms</H2>
12
13In Information Retrieval (IR), the items we are trying to retrieve are called
14<EM>documents</EM>, and each document is described by a collection of <EM>terms</EM>. These
15two words, `document' and `term', are now traditional in the vocabulary of IR,
16and reflect its Library Science origins. Usually a document is thought
17of as a piece of text, most likely in a machine readable form, and a term as
18a word or phrase which helps to describe the document, and which may indeed
19occur one or more times in the document. So a document might be about
20dental care, and could be described by corresponding terms `tooth', `teeth',
21`toothbrush', `decay', `cavity', `plaque', `diet' and so on.
22<P>
23More generally a document can be anything we want to retrieve, and a term any
24feature that helps describe the documents. So the documents could be a
25collection of fossils held in a big museum collection, and the terms could be
26morphological characteristics of the fossils. Or the documents could be tunes,
27and the terms could then be phrases of notes that occur in the tunes.
28<P>
29If, in an IR system, a document, D, is described by a term, t, t is said to
30<EM>index</EM> D, and we can write,
31<PRE>
32            t -&gt; D
33</PRE>
34In fact an IR system consists of a set of documents, D<sub>1</sub>, D<sub>2</sub>, D<sub>3</sub> ..., a set
35of terms t<sub>1</sub>, t<sub>2</sub>, t<sub>3</sub> ..., and set of relationships,
36<PRE>
37            t<sub>i</sub> -&gt; D<sub>j</sub>
38</PRE>
39i.e. instances of terms indexing documents. A single instance of a particular
40term indexing a particular document is called a <EM>posting</EM>.
41<P>
42For a document, D, there is a list of terms which index it. This is called the
43<EM>term list</EM> of D.
44<P>
45
46For a term, t, there is a list of documents which it indexes. This is called
47the <EM>posting list</EM> of t. (`Document list' would be more consistent, but sounds
48a little too vague for this very important concept.)
49<P>
50
51At a simple level a computerised IR system puts the terms in an
52<EM>index</EM> file. A term can be efficiently looked up and its posting list
53found.
54In the posting list, each document is represented by a short identifier.  To
55over-simplify a little, a posting list can be thought
56of as a list of numbers (document ids), and term list as a list of strings
57(the terms).  Some systems represent each term by a number internally, so the
58term list is then also a list of numbers.  Xapian doesn't - it uses the terms
59themselves, and uses prefix compression to store them compactly.
60
61<P>
62The terms needn't be (and often aren't) just the words from the document.
63Usually they are converted to lower case, and often a stemming algorithm is
64applied, so a single term
65`connect' might derive from a number of words, `connect', `connects',
66`connection', `connected' and so on. A single word might also give rise to more
67than one term, for example you might index both stemmed and unstemmed forms
68of some or all terms.  Or a stemming algorithm could conceivably produce
69more than one stem in some cases (this isn't the case for any of the stemming
70algorithms Xapian currently supports, but consider the
71<a href="http://en.wikipedia.org/wiki/Double_Metaphone">Double Metaphone</a>
72phonetic algorithm which can produce two codes from a single input).
73
74<H2>Xapian's context within IR</H2>
75
76In the beginning IR was dominated by Boolean retrieval, described in the next
77section. This could be called the antediluvian period, or generation zero.
78The first generation of IR research dates from the early sixties, and was
79dominated by model building, experimentation, and heuristics. The big names
80were <a href="http://en.wikipedia.org/wiki/Gerard_Salton">Gerard Salton</a>
81and
82<a href="http://en.wikipedia.org/wiki/Karen_Sparck_Jones">Karen Sparck Jones</a>.
83The second period, which began in
84the mid-seventies, saw a big shift towards mathematics, and a rise
85of the IR model based upon probability theory - probabilistic IR. The big
86name here was, and continues to be,
87<a href="http://www.soi.city.ac.uk/~ser/homepage.html">Stephen Robertson</a>.
88More recently
89<a href="http://en.wikipedia.org/wiki/C._J._van_Rijsbergen">Keith van Rijsbergen</a>
90has led a group that has developed underlying logical models
91of IR, but interesting as this new work is, it has not as yet led to results
92that offer improvements for the IR system builder.
93
94<P>
95Xapian was built as a system for efficiently implementing
96the probabilistic IR model (though this doesn't mean it is limited to
97only implementing this model - other models can be implemented providing
98they can be expressed in a suitable way).  Xapian tries to implement the
99probabilistic model faithfully, though in
100some places it can be told to use short-cuts for efficiency.
101
102<P>
103The model has two striking advantages:
104
105<OL><LI>
106It leads to systems that give good retrieval performance. As the model has
107developed over the last 25 years, this has proved so consistently true that
108one is led to suspect that the probability theory model is, in some sense, the
109`correct' model for IR. The IR process would appear to function as the model
110suggests.
111</LI><LI>
112As new problems come up in IR, the probabilistic model can usually suggest
113a solution. This makes it a very practical mental tool for cutting through the
114jungle of possibilities when designing IR software.
115</LI></OL>
116
117In simple cases the model reduces to simple formulae in general use, so don't
118be alarmed by the apparent complexity of the equations below.
119We need them for a full understanding of the general case.
120
121<H2>Boolean retrieval</H2>
122
123A Boolean construct of terms retrieves a corresponding set of documents. So,
124if:
125<PRE>
126     t<sub>1</sub>  indexes documents  1 2 3 5 8
127     t<sub>2</sub>  indexes documents  2 3 6
128</PRE>
129then
130<PRE>
131     t<sub>1</sub> AND t<sub>2</sub>      retrieves  2 3
132     t<sub>1</sub> OR t<sub>2</sub>       retrieves  1 2 3 5 6 8
133     t<sub>1</sub> AND_NOT t<sub>2</sub>  retrieves  1 5 8
134     t<sub>2</sub> AND_NOT t<sub>1</sub>  retrieves  6
135</PRE>
136
137The posting list of a term is a set of documents. IR
138becomes a matter of constructing other sets by doing unions, intersections
139and differences on posting lists.
140<P>
141For example, in an IR system of works of literature, a Boolean query
142<PRE>
143    (lang:en OR lang:fr OR lang:de) AND (type:novel OR type:play) AND century:19
144</PRE>
145might be used to retrieve all English, French
146or German novels or plays of the 19th century.
147
148<P>
149Boolean retrieval is often useful, but is rather inadequate on its
150own as a general IR tool.  Results aren't ordered by any measure of how
151"good" they might be, and users require training to make effective use
152of such a system.  Despite this, purely boolean IR systems continue to
153survive.
154
155<P>
156By default, Xapian uses probabilistic ranking to order retrieved documents
157while allowing Boolean expressions of arbitrary complexity (some boolean IR
158systems are restricted to queries in normal form) to limit those
159documents retrieved, which provides the benefits of both approaches.
160Pure Boolean retrieval is also supported (select the <a
161href="apidoc/html/classXapian_1_1BoolWeight.html">BoolWeight</a> weighting
162scheme using <code>enquire.set_weighting_scheme(Xapian::BoolWeight());</code>).
163
164<H2>Relevance and the idea of a query</H2>
165
166<EM>Relevance</EM> is a central concept to the probabilistic model.
167Whole academic papers have been devoted to discussing the nature of relevance
168but essentially a document is relevant if it was what the user really wanted!
169Retrieval is rarely perfect, so among documents retrieved there
170will be non-relevant ones; among those not retrieved, relevant ones.
171
172<p>Relevance is modelled as a black or white attribute.  There are no degrees
173of relevance, a document either is, or
174is not, relevant. In the probabilistic model there is however a probability
175of relevance, and documents of low probability of relevance
176in the model generally correspond to documents that, in practice, one would describe
177as having low relevance.
178
179<P>
180What the user actually wants has to be expressed in some form, and the
181expression of the user's need is the query. In the probabilistic model the
182query is, usually, a list of terms, but that is the end process of a chain of
183events. The user has a need; this is expressed in ordinary language; this is
184then turned into a written form that the user judges will yield good results
185in an IR system, and the IR system then turns this form into a set, <I>Q</I>, of
186terms for processing the query. Relevance must be judged against the user's
187original need, not against a later interpretation of what <I>Q</I>, the set of
188terms, ought to mean.
189
190<P>
191Below, a query is taken to be just a set of terms, but it is important to
192realise that this is a simplification. Each link in the chain that takes us
193from the <em>information need</em> ("what the user is looking for")
194to the abstraction in <I>Q</I> is liable to error, and
195these errors compound to affect IR performance. In fact the performance of IR
196systems as a whole is much worse than most people generally imagine.
197
198<H2>Evaluating IR performance</H2>
199
200It is possible to set up a test to evaluate an IR system. Suppose <I>Q</I> is a
201query, and out of the complete collection of documents in the IR system, a set
202of documents <I>R</I> of size R are relevant to the query. So if a document is in
203<I>R</I> it is relevant, and if not in <I>R</I> it is non-relevant. Suppose the IR system
204is able to give us back K documents, among which r are relevant. <EM>Precision</EM>
205and <EM>recall</EM> are defined as being,
206
207<blockquote>
208<table border=0><tr valign=center>
209<td><tt>precision =&nbsp;</tt></td>
210<td>
211<tt><center>
212<u>r</u><br>K</center></tt>
213</td>
214<td><tt>,&nbsp;&nbsp;&nbsp;recall =&nbsp;</tt></td>
215<td>
216<tt><center>
217<u>r</u><br>R</center></tt>
218</td>
219</tr></table>
220</blockquote>
221
222Precision is the density of relevant documents among those retrieved. Recall
223is the proportion of relevant documents retrieved. In most IR systems K is a
224parameter that can be varied, and what you find is that when K is low you get
225high precision at the expense of low recall, and when K is high you get high
226recall at the expense of low precision.
227<p>
228The ideal value of K will depend on the use of the system.  For example, if a
229user wants the answer to a simple question and the system contains many
230documents which would answer it, a low value of K will be best to give a
231small number of relevant results.  But in a system indexing legal cases,
232users will often wish to make sure no potentially relevant case is missed
233even if that requires they check more non-relevant cases, so a high value of
234K will be best.
235<p>
236Retrieval effectiveness is often shown as a graph of precision against recall
237average over a number of queries, and plotted for different values of K.
238Such curves typically have a shape similar to a hyperbola (y=1/x).
239<P>
240A collection like this, consisting of a set of documents, a set of queries,
241and for each query, a complete set of relevance assessments, is called a
242<EM>test collection</EM>. With a test collection you can test out different IR
243ideas, and see how well one performs against another. The controversial part
244of establishing any test collection is the procedure employed for determining
245the sets <I>R</I><sub>i</sub>, of relevance assessments. Subjectivity of judgement comes in
246here, and people will differ about whether a particular document is relevant
247to a particular query. Even so, the averaging across queries reduces the
248errors that may occasionally arise through faulty relevance judgements, and
249averaging important tests across a number of test collections reduces the
250effects caused by accidental features of individual collections, and the
251results obtained by these tests in modern research are generally accepted as
252trustworthy.  Nowadays such research with test collections is organised from
253<A HREF="http://trec.nist.gov/">TREC</A>.
254
255<H2>Probabilistic term weights</H2>
256
257In this section we will try to present some of the thinking behind the
258formulae. This is really to give a feel for where the probabilistic model
259comes from. You may want to skim through this section if you're not too interested.
260<P>
261Suppose we have an IR system with a total of N documents. And suppose <I>Q</I> is a
262query in this IR system, made up of terms t<sub>1</sub>, t<sub>2</sub> ... t<sub>Q</sub>. There is a set,
263<I>R</I>, of documents relevant to the query.
264<P>
265In 1976, Stephen Robertson derived a formula which gives an ideal numeric
266weight to a term t of Q. Just how this weight gets used we will see below,
267but essentially a high weight means an important term and a low weight means
268an unimportant term. The formula is,
269
270<blockquote>
271<table border=0><tr valign=center>
272<td><tt>w(t) = log&nbsp;</tt></td>
273<td>
274<font size="+2">(</font>
275</td>
276<td>
277<tt><center>
278<u>p (1 - q)</u><br>(1 - p) q</center></tt>
279</td>
280<td>
281<font size="+2">)</font>
282</td>
283</tr></table>
284</blockquote>
285
286(The base of the logarithm doesn't matter, but we can suppose it is e.) p is
287the probability that t indexes a relevant document, and q the probability
288that t indexes a non-relevant document. And of course, 1 - p is the
289probability that t does not index a relevant document, and 1 - q the
290probability that t does not index a non-relevant document. More mathematically,
291<blockquote><PRE>
292p = P(t -&gt; D | D in <I>R</I>)
293q = P(t -&gt; D | D not in <I>R</I>)
294
2951 - p = P(t not -&gt; D | D in <I>R</I>)
2961 - q = P(t not -&gt; D | D not in <I>R</I>)
297</PRE></blockquote>
298
299Suppose that t indexes n of the N documents in the IR system. As before, we suppose also
300that there are R documents in <I>R</I>, and that there are r documents in <I>R</I> which
301are indexed by t.
302<P>
303p is easily estimated by r/R, the ratio of the number of relevant documents
304indexed by t to the total number of relevant documents.
305</P>
306The total number of non-relevant documents is N - R, and the number of those
307indexed by t is n - r, so we can estimate q as (n - r)/(N - R). This gives us
308the estimates,
309<blockquote>
310<table border=0><tr valign=center>
311<td><tt>&nbsp;&nbsp;&nbsp;&nbsp;p =&nbsp;</tt></td>
312<td>
313<tt><center>
314<u>r</u><br>R</center></tt>
315</td>
316<td><tt>,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1 - q =&nbsp;</tt></td>
317<td>
318<tt><center>
319<u>N - R - n + r</u><br>N - R</center></tt>
320</td>
321</tr></table>
322<table border=0><tr valign=center>
323<td><tt>1 - p =&nbsp;</tt></td>
324<td>
325<tt><center>
326<u>R - r</u><br>R</center></tt>
327</td>
328<td><tt>,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;q =&nbsp;</tt></td>
329<td>
330<tt><center>
331<u>n - r</u><br>N - R</center></tt>
332</td>
333</tr></table>
334</blockquote>
335
336and so substituting in the formula for w(t) we get the estimate,
337
338<blockquote>
339<table border=0><tr valign=center>
340<td>
341<tt>w(t) = log&nbsp;</tt>
342</td>
343<td>
344<font size="+2">(</font>
345</td>
346<td>
347<tt><center>
348<u>r (N - R - n + r)</u><br>(R - r)(n - r)</center></tt>
349</td>
350<td>
351<font size="+2">)</font>
352</td>
353</tr></table>
354</blockquote>
355
356Unfortunately, this formula is subject to violent behaviour when, say, n = r
357(infinity) or r = 0 (minus infinity), and so Robertson suggests the modified
358form
359<blockquote>
360<table border=0><tr valign=center>
361<td>
362<tt>w(t) = log&nbsp;</tt>
363</td>
364<td>
365<font size="+2">(</font>
366</td>
367<td>
368<tt><center>
369<u>(r + &frac12;) (N - R - n + r + &frac12;)</u><br>(R - r + &frac12;) (n - r + &frac12;)</center></tt>
370</td>
371<td>
372<font size="+2">)</font>
373</td>
374</tr></table>
375</blockquote>
376
377with the reassurance that this has "some theoretical justification". This is
378the form of the term weighting formula used in Xapian's BM25Weight.
379<P>
380Note that n is dependent on the term, t, and R on the query, <I>Q</I>, while r
381depends both on t and <I>Q</I>. N is constant, at least until the IR system changes.
382<P>
383At first sight this formula may appear to be quite useless. After all, <I>R</I> is
384what we are trying to find. We can't evaluate w(t) until we have <I>R</I>, and if
385we have <I>R</I> the retrieval process is over, and term weights are no longer
386of any interest to us.
387<P>
388But the point is we can estimate p and q from a subset of <I>R</I>. As soon as some
389records are found relevant by the user they can be used as a working set for
390<I>R</I> from which the weights w(t) can be derived, and these new weights can be
391used to improve the processing of the query.
392<P>
393In fact in the Xapian software <I>R</I> tends to mean not the complete set of
394relevant documents, which indeed can rarely be discovered, but a small set of
395documents which have been judged as relevant.
396<P>
397Suppose we have no documents marked as relevant. Then R = r = 0, and w(t)
398becomes,
399
400<blockquote>
401<table border=0><tr valign=center>
402<td><tt>log&nbsp;</tt></td>
403<td>
404<font size="+2">(</font>
405</td>
406<td>
407<tt><center>
408<u>N - n + &frac12;</u><br>n + &frac12;</center></tt>
409</td>
410<td>
411<font size="+2">)</font>
412</td>
413</tr></table>
414</blockquote>
415
416This is approximately log((N - n)/n). Or log(N/n), since n is usually small
417compared with N. This is called inverse logarithmic weighting, and has been
418used in IR for many decades, quite independently of the probabilistic theory
419which underpins it. Weights of this form are in fact the starting point in
420Xapian when no relevance information is present.
421<P>
422The number n incidentally is often called the <EM>frequency</EM> of a term.
423We prefer
424the phrase <EM>term frequency</EM>, to better distinguish it from wdf and wqf
425introduced below.
426<P>
427In extreme cases w(t) can be negative. In Xapian, negative values are
428disallowed, and simply replaced by a small positive value.
429
430
431<H2>wdp, wdf, ndl and wqf</H2>
432
433Before we see how the weights are used there are a few more ideas to introduce.
434<P>
435As mentioned before, a term t is said to index a document D, or t -&gt; D. We have emphasised that D
436may not be a piece of text in machine-readable form, and that, even when it
437is, t may not actually occur in the text of D. Nevertheless, it will often be
438the case that D is made up of a list of words,
439<PRE>
440        D = w<sub>1</sub>, w<sub>2</sub>, w<sub>3</sub> ... w<sub>m</sub>
441</PRE>
442
443and that many, if not all, of the terms which index D derive from these words
444(for example, the terms are often lower-cased and stemmed forms of these words).
445
446<P>
447If a term derives from words w<sub>9</sub>, w<sub>38</sub>, w<sub>97</sub> and
448w<sub>221</sub> in the indexing process, we can say that the term `occurs' in D at
449positions 9, 38, 97 and 221, and so for each term a document may have a
450vector of positional information. These are the <EM>within-document positions</EM> of
451t, or the <EM>wdp</EM> information of t.
452<P>
453The <EM>within-document frequency</EM>, or <EM>wdf</EM>, of a term t in D is the number of
454times it is pulled out of D in the indexing process. Usually this is the size
455of the wdp vector, but in Xapian it can exceed it, since we can
456apply extra wdf to some parts of the document text.  For example,
457often this is done for the document title and abstract to
458attach extra importance to their contents compared to the
459rest of the document text.
460<P>
461There are various ways in which we might measure the length of a document, but
462the easiest is to suppose it is made up of m words, w<sub>1</sub> to w<sub>m</sub>, and to define
463its length as m.
464<P>
465The <EM>normalised document length</EM>, or <EM>ndl</EM>, is then m divided by the average
466length of the documents in the IR system. So the average length document has
467ndl equal to 1, short documents are less than 1, long documents greater than
4681. We have found that very small ndl values create problems, so Xapian
469actually allows for a non-zero minimum value for the ndl.
470
471<P>
472In the probabilistic model the query, <I>Q</I>, is itself very much like another
473document. Frequently indeed <I>Q</I> will be created from a document, either one
474already in the IR system, or by an indexing process very similar to the one
475used to add documents into the whole IR system. This corresponds to a user
476saying "give me other documents like this one". One can therefore attach a
477similar meaning to within-query position information, within-query frequency,
478and normalised query length, or wqp, wqf and nql. Xapian does not currently
479use the concept of wqp.
480
481
482<H2>Using the weights. The <I>MSet</I></H2>
483
484Now to pull everything together. From the probabilistic term weights we can
485assign a weight to any document, d, as follows,
486
487<blockquote>
488<table border=0><tr valign=center>
489<td><tt>W(d) =&nbsp;</tt></td>
490<td>
491<tt><center>
492<font size="+4">&Sigma;</font><br><small>t -&gt; d, t in <i>Q</i></small></tt>
493</td>
494<td><tt><center>
495<u>(k + 1) f</u><sub>t</sub><br>k.L<sub>d</sub> + f<sub>t</sub>
496</center></tt></td>
497<td><tt>&nbsp;w(t)</tt></td>
498</tr></table>
499</blockquote>
500
501The sum extends over the terms of <I>Q</I> which index d. f<sub>t</sub> is the wdf of t in d,
502L<sub>d</sub> is the ndl of d, and k is some suitably chosen constant.
503<P>
504The factor k+1 is actually redundant, but helps with the interpretation of
505the equation.  In Xapian, this weighting scheme is implemented by the
506<a href="apidoc/html/classXapian_1_1TradWeight.html">Xapian::TradWeight class</a>
507and the factor (k+1) is ignored.
508
509<P>
510If k is set to zero the factor before w(t) is 1, and the wdfs
511are ignored. As k tends to infinity, the factor becomes f<sub>t</sub>/L<sub>d</sub>, and the wdfs
512take on their greatest importance. Intermediate values scale the wdf
513contribution between these extremes.
514The best k actually depends on the characteristics of the IR system as a
515whole, and unfortunately no rule can be given for choosing it.
516By default, Xapian sets k to 1 which should give reasonable results
517for most systems.  W(d) is merely tweaked a bit by the
518wdf values, and users observe a simple pattern of retrieval.
519It is possible to tune k to provide optimal results for a
520specific system.
521<P>
522Any d in the IR system has a value W(d), but, if no term of the query indexes
523d, W(d) will be zero. In practice only documents for which W(d) &gt; 0 will be of
524interest, and these are the documents indexed by at least one term of <I>Q</I>. If
525we now take these documents and arrange them by decreasing W(d) value, we get
526a ranked list called the <EM>match set</EM>, or <EM>MSet</EM>, of document and weight
527pairs:
528<PRE>
529     MSet:        item 0:   D<sub>0</sub>  W(D<sub>0</sub>)
530                  item 1:   D<sub>1</sub>  W(D<sub>1</sub>)
531                  item 2:   D<sub>2</sub>  W(D<sub>2</sub>)
532                        ....
533                  item K:   D<sub>K</sub>  W(D<sub>K</sub>)
534</PRE>
535
536where W(D<sub>j</sub>) &gt;= W(D<sub>i</sub>) if j &gt; i.
537<P>
538And according to the probabilistic model, the documents D<sub>0</sub>, D<sub>1</sub>, D<sub>2</sub> ... are
539ranked by decreasing order of probability of relevance. So D<sub>0</sub> has highest
540probability of being relevant, then D<sub>1</sub> and so on.
541<P>
542Xapian creates the MSet from the posting lists of the terms
543of the query. This is the central operation of any IR system, and will be
544familiar to anyone who has used one of the Internet's major search engines,
545where the query is what you type in the query box, and the resulting hit list
546corresponds to the top few items of the MSet.
547<P>
548
549The cutoff point, K, is chosen when the MSet is created. The candidates for
550inclusion in the MSet are all documents indexed by at least one term of <I>Q</I>,
551and their number will usually exceed the choice of K (K is typically set to
552be 1000 or less). So the MSet is actually the best K documents found in the
553match process.
554<P>
555
556A modification of this weighting scheme can be employed that takes into
557account the query itself:
558
559<blockquote>
560<table border=0><tr valign=center>
561<td><tt>W(d) =&nbsp;</tt></td>
562<td>
563<tt><center>
564<font size="+4">&Sigma;</font><br><small>t -&gt; d, t in <i>Q</i></small></center></tt>
565</td>
566<td><tt><center>
567<u>(k<sub>3</sub> + 1) q</u><sub>t</sub><br>k<sub>3</sub>L' + q<sub>t</sub>
568</center></tt></td>
569<td><tt>&nbsp;</tt></td>
570<td><tt><center>
571<u>(k + 1) f</u><sub>t</sub><br>kL<sub>d</sub> + f<sub>t</sub>
572</center></tt></td>
573<td><tt>&nbsp;w(t)</tt></td>
574</tr></table>
575</blockquote>
576
577where q<sub>t</sub> is the wqf of t in <I>Q</I>, L' is the nql, or normalised query length,
578and k<sub>3</sub> is a further constant. In computing W(d) across the document space,
579this extra factor may be viewed as just a modification to the basic term
580weights, w(t). Like k and k<sub>3</sub>, we will need to make an inspired guess for L'.
581In fact the choices for k<sub>3</sub> and L' will depend on the broader context of the
582use of this formula, and more advice will be given as occasion arises.
583
584<p>
585Xapian's default weighting scheme is a generalised form of this weighting
586scheme modification, known as <a href="bm25.html">BM25</a>.  In BM25, L' is
587always set to 1.
588
589<H2>Using the weights: the <I>ESet</I></H2>
590
591But as well as ranking documents, Xapian can rank terms, and this is most
592important.
593The higher up the ranking the term is, the more likely it is
594to act as a good differentiator between relevant and non-relevant documents.
595It is therefore a candidate for adding back into the query. Terms from this
596list can therefore be used to expand the size of the query, after which the
597query can be re-run to get a better MSet. Because this list of terms is
598mainly used for query expansion, it is called the <EM>expand set</EM> or <EM>ESet</EM>.
599
600<P>
601The term expansion weighting formula is as follows,
602
603<blockquote>
604<PRE>
605W(t) = r w(t)
606</PRE>
607</blockquote>
608
609in other words we multiply the term weight by the number of
610relevant documents that have been indexed by the term.
611<P>
612The ESet then has this form,
613
614<PRE>
615     ESet:        item 0:   t<sub>0</sub>  W(t<sub>0</sub>)
616                  item 1:   t<sub>1</sub>  W(t<sub>1</sub>)
617                  item 2:   t<sub>2</sub>  W(t<sub>2</sub>)
618                        ....
619                  item K:   t<sub>K</sub>  W(t<sub>K</sub>)
620</PRE>
621
622where W(t<sub>j</sub>) &gt;= W(t<sub>i</sub>) if j &gt; i.
623<P>
624Since the main function of the ESet is to find new terms to be added to <I>Q</I>,
625we usually omit from it terms already in <I>Q</I>.
626<P>
627The W(t) weight is applicable to any term in the IR system, but has a value
628zero when t does not index a relevant document. The ESet is therefore
629confined to be a ranking of the best K terms which index relevant documents.
630<P>
631This simple form of W(t) is traditional in the probabilistic model, but
632seems less than optimal because it does not take into account wdf information.
633One can if fact try to generalise it to
634
635<blockquote>
636<table border=0><tr valign=center>
637<td><tt>W(t) =&nbsp;</tt></td>
638<td>
639<tt><center>
640<font size="+4">&Sigma;</font><br><small>t -&gt; d, d in <i>R</i></small></tt>
641</td>
642<td><tt><center>
643<u>(k + 1) f</u><sub>t</sub><br>kL + f<sub>t</sub>
644</center></tt></td>
645<td><tt>&nbsp;w(t)</tt></td>
646</tr></table>
647</blockquote>
648
649k is again a constant, but it does not need to have the same
650value as the k used in the probabilistic term weights above.
651<!-- FIXME: Might it have the same value?  If not, better to rename it?
652Is it be sensible to set it the same as the other k if you change that? -->
653In Xapian, k defaults to 1.0 for ESet generation.
654<P>
655This reduces to W(t) = r w(t) when k = 0. Certainly this form can be
656recommended in the very common case where r = 1, that is, we have a single
657document marked relevant.
658
659<H2>The progress of a query</H2>
660
661<p>
662Below we describe the general case of the IR model supported, including
663use of a relevance set (<a href="glossary.html#rset">RSet</a>),
664query expansion, improved term weights and reranking.
665You don't have to use any of these for Xapian to be useful, but they are
666available should you need them.
667</p>
668
669<p>
670The user enters a query. This is parsed into a form the IR system understands,
671and run by the IR system, which returns two lists, a
672list of captions, derived from the MSet, and a list of terms, from the ESet.
673If the RSet is empty, the first few documents of the MSet can be
674used as a stand-in - after all, they have a good chance of being relevant! You
675can read a document by clicking on the caption. (We assume the usual
676screen/mouse environment.) But you can also mark a document as relevant
677(change <I>R</I>) or cause a term to be added from the ESet to the query (change
678<I>Q</I>). As soon as any change is made to the query environment the query can be
679rerun, although you might have a front-end where nothing happens until you
680click on some "Run Query" button.
681</p>
682
683<p>
684In any case rerunning the query leads to a new MSet and ESet, and so to a
685new display. The IR process is then an iterative one. You can delete terms
686from the query or add them in; mark or unmark documents as being relevant.
687Eventually you converge on the answer to the query, or at least, the best
688answer the IR system can give you.
689</p>
690
691<H2>Further Reading</H2>
692
693<P>If you want to find out more, then
694<a href="http://citeseer.ist.psu.edu/robertson97simple.html"><i>"Simple,
695proven approaches to text retrieval"</i></a> is a worthwhile read.  It's a
696good introduction to Probabilistic Information retrieval,
697which is basically what Xapian provides.
698
699<P>
700There are also several good books on the subject of Information retrieval.
701
702<UL>
703<LI>&quot;<em>Information Retrieval</em>&quot; by C. J. van
704Rijsbergen is well worth reading.  It's out of print, but is available
705for free <A
706HREF="http://www.dcs.gla.ac.uk/Keith/Preface.html">from the author's website</A>
707(in HTML or PDF).
708
709<LI> &quot;<em>Readings in Information Retrieval</em>&quot; (published by
710Morgan Kaufmann, edited by Karen Sparck Jones and Peter Willett) is a
711collection of published papers covering many aspects of the subject.
712
713<LI> &quot;<em>Managing Gigabytes</em>&quot; (also published by
714Morgan Kaufmann, written by Ian H. Witten, Alistair Moffat and
715Timothy C. Bell) describes information retrieval and compression techniques.
716
717<LI> &quot;<em>Modern Information Retrieval</em>&quot; (published by
718Addison Wesley, written by Ricardo Baeza-Yates and Berthier Ribeiro-Neto)
719gives a good overview of the field.  It was published more recently than
720the books above, and so covers some more recent developments.
721
722<li> &quot;<em>Introduction to Information Retrieval</em>&quot; (<b>to
723be published in 2008</b> by Cambridge University Press, written by
724Christopher D. Manning, Prabhakar Raghavan and Hinrich Schi&uuml;tze)
725looks to be a good introductory work (we've not read it in detail yet).
726It's not yet published, but the latest draft can be read online at
727<a href="http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html">the
728book's companion website</a> which says the online version will remain
729after publication.
730
731</UL>
732
733<!-- FOOTER $Author$ $Date$ $Id$ -->
734</BODY>
735</HTML>
Note: See TracBrowser for help on using the browser.