| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>Xapian: Theoretical Background</TITLE> |
|---|
| 5 | </HEAD> |
|---|
| 6 | <BODY BGCOLOR="white"> |
|---|
| 7 | <H1>Theoretical Background</H1> |
|---|
| 8 | |
|---|
| 9 | <p>This document aims to provide some theoretical background to Xapian.</p> |
|---|
| 10 | |
|---|
| 11 | <H2>Documents and terms</H2> |
|---|
| 12 | |
|---|
| 13 | In Information Retrieval (IR), the items we are trying to retrieve are called |
|---|
| 14 | <EM>documents</EM>, and each document is described by a collection of <EM>terms</EM>. These |
|---|
| 15 | two words, `document' and `term', are now traditional in the vocabulary of IR, |
|---|
| 16 | and reflect its Library Science origins. Usually a document is thought |
|---|
| 17 | of as a piece of text, most likely in a machine readable form, and a term as |
|---|
| 18 | a word or phrase which helps to describe the document, and which may indeed |
|---|
| 19 | occur one or more times in the document. So a document might be about |
|---|
| 20 | dental care, and could be described by corresponding terms `tooth', `teeth', |
|---|
| 21 | `toothbrush', `decay', `cavity', `plaque', `diet' and so on. |
|---|
| 22 | <P> |
|---|
| 23 | More generally a document can be anything we want to retrieve, and a term any |
|---|
| 24 | feature that helps describe the documents. So the documents could be a |
|---|
| 25 | collection of fossils held in a big museum collection, and the terms could be |
|---|
| 26 | morphological characteristics of the fossils. Or the documents could be tunes, |
|---|
| 27 | and the terms could then be phrases of notes that occur in the tunes. |
|---|
| 28 | <P> |
|---|
| 29 | If, in an IR system, a document, D, is described by a term, t, t is said to |
|---|
| 30 | <EM>index</EM> D, and we can write, |
|---|
| 31 | <PRE> |
|---|
| 32 | t -> D |
|---|
| 33 | </PRE> |
|---|
| 34 | In fact an IR system consists of a set of documents, D<sub>1</sub>, D<sub>2</sub>, D<sub>3</sub> ..., a set |
|---|
| 35 | of terms t<sub>1</sub>, t<sub>2</sub>, t<sub>3</sub> ..., and set of relationships, |
|---|
| 36 | <PRE> |
|---|
| 37 | t<sub>i</sub> -> D<sub>j</sub> |
|---|
| 38 | </PRE> |
|---|
| 39 | i.e. instances of terms indexing documents. A single instance of a particular |
|---|
| 40 | term indexing a particular document is called a <EM>posting</EM>. |
|---|
| 41 | <P> |
|---|
| 42 | For a document, D, there is a list of terms which index it. This is called the |
|---|
| 43 | <EM>term list</EM> of D. |
|---|
| 44 | <P> |
|---|
| 45 | |
|---|
| 46 | For a term, t, there is a list of documents which it indexes. This is called |
|---|
| 47 | the <EM>posting list</EM> of t. (`Document list' would be more consistent, but sounds |
|---|
| 48 | a little too vague for this very important concept.) |
|---|
| 49 | <P> |
|---|
| 50 | |
|---|
| 51 | At a simple level a computerised IR system puts the terms in an |
|---|
| 52 | <EM>index</EM> file. A term can be efficiently looked up and its posting list |
|---|
| 53 | found. |
|---|
| 54 | In the posting list, each document is represented by a short identifier. To |
|---|
| 55 | over-simplify a little, a posting list can be thought |
|---|
| 56 | of as a list of numbers (document ids), and term list as a list of strings |
|---|
| 57 | (the terms). Some systems represent each term by a number internally, so the |
|---|
| 58 | term list is then also a list of numbers. Xapian doesn't - it uses the terms |
|---|
| 59 | themselves, and uses prefix compression to store them compactly. |
|---|
| 60 | |
|---|
| 61 | <P> |
|---|
| 62 | The terms needn't be (and often aren't) just the words from the document. |
|---|
| 63 | Usually they are converted to lower case, and often a stemming algorithm is |
|---|
| 64 | applied, so a single term |
|---|
| 65 | `connect' might derive from a number of words, `connect', `connects', |
|---|
| 66 | `connection', `connected' and so on. A single word might also give rise to more |
|---|
| 67 | than one term, for example you might index both stemmed and unstemmed forms |
|---|
| 68 | of some or all terms. Or a stemming algorithm could conceivably produce |
|---|
| 69 | more than one stem in some cases (this isn't the case for any of the stemming |
|---|
| 70 | algorithms Xapian currently supports, but consider the |
|---|
| 71 | <a href="http://en.wikipedia.org/wiki/Double_Metaphone">Double Metaphone</a> |
|---|
| 72 | phonetic algorithm which can produce two codes from a single input). |
|---|
| 73 | |
|---|
| 74 | <H2>Xapian's context within IR</H2> |
|---|
| 75 | |
|---|
| 76 | In the beginning IR was dominated by Boolean retrieval, described in the next |
|---|
| 77 | section. This could be called the antediluvian period, or generation zero. |
|---|
| 78 | The first generation of IR research dates from the early sixties, and was |
|---|
| 79 | dominated by model building, experimentation, and heuristics. The big names |
|---|
| 80 | were <a href="http://en.wikipedia.org/wiki/Gerard_Salton">Gerard Salton</a> |
|---|
| 81 | and |
|---|
| 82 | <a href="http://en.wikipedia.org/wiki/Karen_Sparck_Jones">Karen Sparck Jones</a>. |
|---|
| 83 | The second period, which began in |
|---|
| 84 | the mid-seventies, saw a big shift towards mathematics, and a rise |
|---|
| 85 | of the IR model based upon probability theory - probabilistic IR. The big |
|---|
| 86 | name here was, and continues to be, |
|---|
| 87 | <a href="http://www.soi.city.ac.uk/~ser/homepage.html">Stephen Robertson</a>. |
|---|
| 88 | More recently |
|---|
| 89 | <a href="http://en.wikipedia.org/wiki/C._J._van_Rijsbergen">Keith van Rijsbergen</a> |
|---|
| 90 | has led a group that has developed underlying logical models |
|---|
| 91 | of IR, but interesting as this new work is, it has not as yet led to results |
|---|
| 92 | that offer improvements for the IR system builder. |
|---|
| 93 | |
|---|
| 94 | <P> |
|---|
| 95 | Xapian was built as a system for efficiently implementing |
|---|
| 96 | the probabilistic IR model (though this doesn't mean it is limited to |
|---|
| 97 | only implementing this model - other models can be implemented providing |
|---|
| 98 | they can be expressed in a suitable way). Xapian tries to implement the |
|---|
| 99 | probabilistic model faithfully, though in |
|---|
| 100 | some places it can be told to use short-cuts for efficiency. |
|---|
| 101 | |
|---|
| 102 | <P> |
|---|
| 103 | The model has two striking advantages: |
|---|
| 104 | |
|---|
| 105 | <OL><LI> |
|---|
| 106 | It leads to systems that give good retrieval performance. As the model has |
|---|
| 107 | developed over the last 25 years, this has proved so consistently true that |
|---|
| 108 | one is led to suspect that the probability theory model is, in some sense, the |
|---|
| 109 | `correct' model for IR. The IR process would appear to function as the model |
|---|
| 110 | suggests. |
|---|
| 111 | </LI><LI> |
|---|
| 112 | As new problems come up in IR, the probabilistic model can usually suggest |
|---|
| 113 | a solution. This makes it a very practical mental tool for cutting through the |
|---|
| 114 | jungle of possibilities when designing IR software. |
|---|
| 115 | </LI></OL> |
|---|
| 116 | |
|---|
| 117 | In simple cases the model reduces to simple formulae in general use, so don't |
|---|
| 118 | be alarmed by the apparent complexity of the equations below. |
|---|
| 119 | We need them for a full understanding of the general case. |
|---|
| 120 | |
|---|
| 121 | <H2>Boolean retrieval</H2> |
|---|
| 122 | |
|---|
| 123 | A Boolean construct of terms retrieves a corresponding set of documents. So, |
|---|
| 124 | if: |
|---|
| 125 | <PRE> |
|---|
| 126 | t<sub>1</sub> indexes documents 1 2 3 5 8 |
|---|
| 127 | t<sub>2</sub> indexes documents 2 3 6 |
|---|
| 128 | </PRE> |
|---|
| 129 | then |
|---|
| 130 | <PRE> |
|---|
| 131 | t<sub>1</sub> AND t<sub>2</sub> retrieves 2 3 |
|---|
| 132 | t<sub>1</sub> OR t<sub>2</sub> retrieves 1 2 3 5 6 8 |
|---|
| 133 | t<sub>1</sub> AND_NOT t<sub>2</sub> retrieves 1 5 8 |
|---|
| 134 | t<sub>2</sub> AND_NOT t<sub>1</sub> retrieves 6 |
|---|
| 135 | </PRE> |
|---|
| 136 | |
|---|
| 137 | The posting list of a term is a set of documents. IR |
|---|
| 138 | becomes a matter of constructing other sets by doing unions, intersections |
|---|
| 139 | and differences on posting lists. |
|---|
| 140 | <P> |
|---|
| 141 | For example, in an IR system of works of literature, a Boolean query |
|---|
| 142 | <PRE> |
|---|
| 143 | (lang:en OR lang:fr OR lang:de) AND (type:novel OR type:play) AND century:19 |
|---|
| 144 | </PRE> |
|---|
| 145 | might be used to retrieve all English, French |
|---|
| 146 | or German novels or plays of the 19th century. |
|---|
| 147 | |
|---|
| 148 | <P> |
|---|
| 149 | Boolean retrieval is often useful, but is rather inadequate on its |
|---|
| 150 | own as a general IR tool. Results aren't ordered by any measure of how |
|---|
| 151 | "good" they might be, and users require training to make effective use |
|---|
| 152 | of such a system. Despite this, purely boolean IR systems continue to |
|---|
| 153 | survive. |
|---|
| 154 | |
|---|
| 155 | <P> |
|---|
| 156 | By default, Xapian uses probabilistic ranking to order retrieved documents |
|---|
| 157 | while allowing Boolean expressions of arbitrary complexity (some boolean IR |
|---|
| 158 | systems are restricted to queries in normal form) to limit those |
|---|
| 159 | documents retrieved, which provides the benefits of both approaches. |
|---|
| 160 | Pure Boolean retrieval is also supported (select the <a |
|---|
| 161 | href="apidoc/html/classXapian_1_1BoolWeight.html">BoolWeight</a> weighting |
|---|
| 162 | scheme using <code>enquire.set_weighting_scheme(Xapian::BoolWeight());</code>). |
|---|
| 163 | |
|---|
| 164 | <H2>Relevance and the idea of a query</H2> |
|---|
| 165 | |
|---|
| 166 | <EM>Relevance</EM> is a central concept to the probabilistic model. |
|---|
| 167 | Whole academic papers have been devoted to discussing the nature of relevance |
|---|
| 168 | but essentially a document is relevant if it was what the user really wanted! |
|---|
| 169 | Retrieval is rarely perfect, so among documents retrieved there |
|---|
| 170 | will be non-relevant ones; among those not retrieved, relevant ones. |
|---|
| 171 | |
|---|
| 172 | <p>Relevance is modelled as a black or white attribute. There are no degrees |
|---|
| 173 | of relevance, a document either is, or |
|---|
| 174 | is not, relevant. In the probabilistic model there is however a probability |
|---|
| 175 | of relevance, and documents of low probability of relevance |
|---|
| 176 | in the model generally correspond to documents that, in practice, one would describe |
|---|
| 177 | as having low relevance. |
|---|
| 178 | |
|---|
| 179 | <P> |
|---|
| 180 | What the user actually wants has to be expressed in some form, and the |
|---|
| 181 | expression of the user's need is the query. In the probabilistic model the |
|---|
| 182 | query is, usually, a list of terms, but that is the end process of a chain of |
|---|
| 183 | events. The user has a need; this is expressed in ordinary language; this is |
|---|
| 184 | then turned into a written form that the user judges will yield good results |
|---|
| 185 | in an IR system, and the IR system then turns this form into a set, <I>Q</I>, of |
|---|
| 186 | terms for processing the query. Relevance must be judged against the user's |
|---|
| 187 | original need, not against a later interpretation of what <I>Q</I>, the set of |
|---|
| 188 | terms, ought to mean. |
|---|
| 189 | |
|---|
| 190 | <P> |
|---|
| 191 | Below, a query is taken to be just a set of terms, but it is important to |
|---|
| 192 | realise that this is a simplification. Each link in the chain that takes us |
|---|
| 193 | from the <em>information need</em> ("what the user is looking for") |
|---|
| 194 | to the abstraction in <I>Q</I> is liable to error, and |
|---|
| 195 | these errors compound to affect IR performance. In fact the performance of IR |
|---|
| 196 | systems as a whole is much worse than most people generally imagine. |
|---|
| 197 | |
|---|
| 198 | <H2>Evaluating IR performance</H2> |
|---|
| 199 | |
|---|
| 200 | It is possible to set up a test to evaluate an IR system. Suppose <I>Q</I> is a |
|---|
| 201 | query, and out of the complete collection of documents in the IR system, a set |
|---|
| 202 | of documents <I>R</I> of size R are relevant to the query. So if a document is in |
|---|
| 203 | <I>R</I> it is relevant, and if not in <I>R</I> it is non-relevant. Suppose the IR system |
|---|
| 204 | is able to give us back K documents, among which r are relevant. <EM>Precision</EM> |
|---|
| 205 | and <EM>recall</EM> are defined as being, |
|---|
| 206 | |
|---|
| 207 | <blockquote> |
|---|
| 208 | <table border=0><tr valign=center> |
|---|
| 209 | <td><tt>precision = </tt></td> |
|---|
| 210 | <td> |
|---|
| 211 | <tt><center> |
|---|
| 212 | <u>r</u><br>K</center></tt> |
|---|
| 213 | </td> |
|---|
| 214 | <td><tt>, recall = </tt></td> |
|---|
| 215 | <td> |
|---|
| 216 | <tt><center> |
|---|
| 217 | <u>r</u><br>R</center></tt> |
|---|
| 218 | </td> |
|---|
| 219 | </tr></table> |
|---|
| 220 | </blockquote> |
|---|
| 221 | |
|---|
| 222 | Precision is the density of relevant documents among those retrieved. Recall |
|---|
| 223 | is the proportion of relevant documents retrieved. In most IR systems K is a |
|---|
| 224 | parameter that can be varied, and what you find is that when K is low you get |
|---|
| 225 | high precision at the expense of low recall, and when K is high you get high |
|---|
| 226 | recall at the expense of low precision. |
|---|
| 227 | <p> |
|---|
| 228 | The ideal value of K will depend on the use of the system. For example, if a |
|---|
| 229 | user wants the answer to a simple question and the system contains many |
|---|
| 230 | documents which would answer it, a low value of K will be best to give a |
|---|
| 231 | small number of relevant results. But in a system indexing legal cases, |
|---|
| 232 | users will often wish to make sure no potentially relevant case is missed |
|---|
| 233 | even if that requires they check more non-relevant cases, so a high value of |
|---|
| 234 | K will be best. |
|---|
| 235 | <p> |
|---|
| 236 | Retrieval effectiveness is often shown as a graph of precision against recall |
|---|
| 237 | average over a number of queries, and plotted for different values of K. |
|---|
| 238 | Such curves typically have a shape similar to a hyperbola (y=1/x). |
|---|
| 239 | <P> |
|---|
| 240 | A collection like this, consisting of a set of documents, a set of queries, |
|---|
| 241 | and for each query, a complete set of relevance assessments, is called a |
|---|
| 242 | <EM>test collection</EM>. With a test collection you can test out different IR |
|---|
| 243 | ideas, and see how well one performs against another. The controversial part |
|---|
| 244 | of establishing any test collection is the procedure employed for determining |
|---|
| 245 | the sets <I>R</I><sub>i</sub>, of relevance assessments. Subjectivity of judgement comes in |
|---|
| 246 | here, and people will differ about whether a particular document is relevant |
|---|
| 247 | to a particular query. Even so, the averaging across queries reduces the |
|---|
| 248 | errors that may occasionally arise through faulty relevance judgements, and |
|---|
| 249 | averaging important tests across a number of test collections reduces the |
|---|
| 250 | effects caused by accidental features of individual collections, and the |
|---|
| 251 | results obtained by these tests in modern research are generally accepted as |
|---|
| 252 | trustworthy. Nowadays such research with test collections is organised from |
|---|
| 253 | <A HREF="http://trec.nist.gov/">TREC</A>. |
|---|
| 254 | |
|---|
| 255 | <H2>Probabilistic term weights</H2> |
|---|
| 256 | |
|---|
| 257 | In this section we will try to present some of the thinking behind the |
|---|
| 258 | formulae. This is really to give a feel for where the probabilistic model |
|---|
| 259 | comes from. You may want to skim through this section if you're not too interested. |
|---|
| 260 | <P> |
|---|
| 261 | Suppose we have an IR system with a total of N documents. And suppose <I>Q</I> is a |
|---|
| 262 | query in this IR system, made up of terms t<sub>1</sub>, t<sub>2</sub> ... t<sub>Q</sub>. There is a set, |
|---|
| 263 | <I>R</I>, of documents relevant to the query. |
|---|
| 264 | <P> |
|---|
| 265 | In 1976, Stephen Robertson derived a formula which gives an ideal numeric |
|---|
| 266 | weight to a term t of Q. Just how this weight gets used we will see below, |
|---|
| 267 | but essentially a high weight means an important term and a low weight means |
|---|
| 268 | an unimportant term. The formula is, |
|---|
| 269 | |
|---|
| 270 | <blockquote> |
|---|
| 271 | <table border=0><tr valign=center> |
|---|
| 272 | <td><tt>w(t) = log </tt></td> |
|---|
| 273 | <td> |
|---|
| 274 | <font size="+2">(</font> |
|---|
| 275 | </td> |
|---|
| 276 | <td> |
|---|
| 277 | <tt><center> |
|---|
| 278 | <u>p (1 - q)</u><br>(1 - p) q</center></tt> |
|---|
| 279 | </td> |
|---|
| 280 | <td> |
|---|
| 281 | <font size="+2">)</font> |
|---|
| 282 | </td> |
|---|
| 283 | </tr></table> |
|---|
| 284 | </blockquote> |
|---|
| 285 | |
|---|
| 286 | (The base of the logarithm doesn't matter, but we can suppose it is e.) p is |
|---|
| 287 | the probability that t indexes a relevant document, and q the probability |
|---|
| 288 | that t indexes a non-relevant document. And of course, 1 - p is the |
|---|
| 289 | probability that t does not index a relevant document, and 1 - q the |
|---|
| 290 | probability that t does not index a non-relevant document. More mathematically, |
|---|
| 291 | <blockquote><PRE> |
|---|
| 292 | p = P(t -> D | D in <I>R</I>) |
|---|
| 293 | q = P(t -> D | D not in <I>R</I>) |
|---|
| 294 | |
|---|
| 295 | 1 - p = P(t not -> D | D in <I>R</I>) |
|---|
| 296 | 1 - q = P(t not -> D | D not in <I>R</I>) |
|---|
| 297 | </PRE></blockquote> |
|---|
| 298 | |
|---|
| 299 | Suppose that t indexes n of the N documents in the IR system. As before, we suppose also |
|---|
| 300 | that there are R documents in <I>R</I>, and that there are r documents in <I>R</I> which |
|---|
| 301 | are indexed by t. |
|---|
| 302 | <P> |
|---|
| 303 | p is easily estimated by r/R, the ratio of the number of relevant documents |
|---|
| 304 | indexed by t to the total number of relevant documents. |
|---|
| 305 | </P> |
|---|
| 306 | The total number of non-relevant documents is N - R, and the number of those |
|---|
| 307 | indexed by t is n - r, so we can estimate q as (n - r)/(N - R). This gives us |
|---|
| 308 | the estimates, |
|---|
| 309 | <blockquote> |
|---|
| 310 | <table border=0><tr valign=center> |
|---|
| 311 | <td><tt> p = </tt></td> |
|---|
| 312 | <td> |
|---|
| 313 | <tt><center> |
|---|
| 314 | <u>r</u><br>R</center></tt> |
|---|
| 315 | </td> |
|---|
| 316 | <td><tt>, 1 - q = </tt></td> |
|---|
| 317 | <td> |
|---|
| 318 | <tt><center> |
|---|
| 319 | <u>N - R - n + r</u><br>N - R</center></tt> |
|---|
| 320 | </td> |
|---|
| 321 | </tr></table> |
|---|
| 322 | <table border=0><tr valign=center> |
|---|
| 323 | <td><tt>1 - p = </tt></td> |
|---|
| 324 | <td> |
|---|
| 325 | <tt><center> |
|---|
| 326 | <u>R - r</u><br>R</center></tt> |
|---|
| 327 | </td> |
|---|
| 328 | <td><tt>, q = </tt></td> |
|---|
| 329 | <td> |
|---|
| 330 | <tt><center> |
|---|
| 331 | <u>n - r</u><br>N - R</center></tt> |
|---|
| 332 | </td> |
|---|
| 333 | </tr></table> |
|---|
| 334 | </blockquote> |
|---|
| 335 | |
|---|
| 336 | and so substituting in the formula for w(t) we get the estimate, |
|---|
| 337 | |
|---|
| 338 | <blockquote> |
|---|
| 339 | <table border=0><tr valign=center> |
|---|
| 340 | <td> |
|---|
| 341 | <tt>w(t) = log </tt> |
|---|
| 342 | </td> |
|---|
| 343 | <td> |
|---|
| 344 | <font size="+2">(</font> |
|---|
| 345 | </td> |
|---|
| 346 | <td> |
|---|
| 347 | <tt><center> |
|---|
| 348 | <u>r (N - R - n + r)</u><br>(R - r)(n - r)</center></tt> |
|---|
| 349 | </td> |
|---|
| 350 | <td> |
|---|
| 351 | <font size="+2">)</font> |
|---|
| 352 | </td> |
|---|
| 353 | </tr></table> |
|---|
| 354 | </blockquote> |
|---|
| 355 | |
|---|
| 356 | Unfortunately, this formula is subject to violent behaviour when, say, n = r |
|---|
| 357 | (infinity) or r = 0 (minus infinity), and so Robertson suggests the modified |
|---|
| 358 | form |
|---|
| 359 | <blockquote> |
|---|
| 360 | <table border=0><tr valign=center> |
|---|
| 361 | <td> |
|---|
| 362 | <tt>w(t) = log </tt> |
|---|
| 363 | </td> |
|---|
| 364 | <td> |
|---|
| 365 | <font size="+2">(</font> |
|---|
| 366 | </td> |
|---|
| 367 | <td> |
|---|
| 368 | <tt><center> |
|---|
| 369 | <u>(r + ½) (N - R - n + r + ½)</u><br>(R - r + ½) (n - r + ½)</center></tt> |
|---|
| 370 | </td> |
|---|
| 371 | <td> |
|---|
| 372 | <font size="+2">)</font> |
|---|
| 373 | </td> |
|---|
| 374 | </tr></table> |
|---|
| 375 | </blockquote> |
|---|
| 376 | |
|---|
| 377 | with the reassurance that this has "some theoretical justification". This is |
|---|
| 378 | the form of the term weighting formula used in Xapian's BM25Weight. |
|---|
| 379 | <P> |
|---|
| 380 | Note that n is dependent on the term, t, and R on the query, <I>Q</I>, while r |
|---|
| 381 | depends both on t and <I>Q</I>. N is constant, at least until the IR system changes. |
|---|
| 382 | <P> |
|---|
| 383 | At first sight this formula may appear to be quite useless. After all, <I>R</I> is |
|---|
| 384 | what we are trying to find. We can't evaluate w(t) until we have <I>R</I>, and if |
|---|
| 385 | we have <I>R</I> the retrieval process is over, and term weights are no longer |
|---|
| 386 | of any interest to us. |
|---|
| 387 | <P> |
|---|
| 388 | But the point is we can estimate p and q from a subset of <I>R</I>. As soon as some |
|---|
| 389 | records are found relevant by the user they can be used as a working set for |
|---|
| 390 | <I>R</I> from which the weights w(t) can be derived, and these new weights can be |
|---|
| 391 | used to improve the processing of the query. |
|---|
| 392 | <P> |
|---|
| 393 | In fact in the Xapian software <I>R</I> tends to mean not the complete set of |
|---|
| 394 | relevant documents, which indeed can rarely be discovered, but a small set of |
|---|
| 395 | documents which have been judged as relevant. |
|---|
| 396 | <P> |
|---|
| 397 | Suppose we have no documents marked as relevant. Then R = r = 0, and w(t) |
|---|
| 398 | becomes, |
|---|
| 399 | |
|---|
| 400 | <blockquote> |
|---|
| 401 | <table border=0><tr valign=center> |
|---|
| 402 | <td><tt>log </tt></td> |
|---|
| 403 | <td> |
|---|
| 404 | <font size="+2">(</font> |
|---|
| 405 | </td> |
|---|
| 406 | <td> |
|---|
| 407 | <tt><center> |
|---|
| 408 | <u>N - n + ½</u><br>n + ½</center></tt> |
|---|
| 409 | </td> |
|---|
| 410 | <td> |
|---|
| 411 | <font size="+2">)</font> |
|---|
| 412 | </td> |
|---|
| 413 | </tr></table> |
|---|
| 414 | </blockquote> |
|---|
| 415 | |
|---|
| 416 | This is approximately log((N - n)/n). Or log(N/n), since n is usually small |
|---|
| 417 | compared with N. This is called inverse logarithmic weighting, and has been |
|---|
| 418 | used in IR for many decades, quite independently of the probabilistic theory |
|---|
| 419 | which underpins it. Weights of this form are in fact the starting point in |
|---|
| 420 | Xapian when no relevance information is present. |
|---|
| 421 | <P> |
|---|
| 422 | The number n incidentally is often called the <EM>frequency</EM> of a term. |
|---|
| 423 | We prefer |
|---|
| 424 | the phrase <EM>term frequency</EM>, to better distinguish it from wdf and wqf |
|---|
| 425 | introduced below. |
|---|
| 426 | <P> |
|---|
| 427 | In extreme cases w(t) can be negative. In Xapian, negative values are |
|---|
| 428 | disallowed, and simply replaced by a small positive value. |
|---|
| 429 | |
|---|
| 430 | |
|---|
| 431 | <H2>wdp, wdf, ndl and wqf</H2> |
|---|
| 432 | |
|---|
| 433 | Before we see how the weights are used there are a few more ideas to introduce. |
|---|
| 434 | <P> |
|---|
| 435 | As mentioned before, a term t is said to index a document D, or t -> D. We have emphasised that D |
|---|
| 436 | may not be a piece of text in machine-readable form, and that, even when it |
|---|
| 437 | is, t may not actually occur in the text of D. Nevertheless, it will often be |
|---|
| 438 | the case that D is made up of a list of words, |
|---|
| 439 | <PRE> |
|---|
| 440 | D = w<sub>1</sub>, w<sub>2</sub>, w<sub>3</sub> ... w<sub>m</sub> |
|---|
| 441 | </PRE> |
|---|
| 442 | |
|---|
| 443 | and that many, if not all, of the terms which index D derive from these words |
|---|
| 444 | (for example, the terms are often lower-cased and stemmed forms of these words). |
|---|
| 445 | |
|---|
| 446 | <P> |
|---|
| 447 | If a term derives from words w<sub>9</sub>, w<sub>38</sub>, w<sub>97</sub> and |
|---|
| 448 | w<sub>221</sub> in the indexing process, we can say that the term `occurs' in D at |
|---|
| 449 | positions 9, 38, 97 and 221, and so for each term a document may have a |
|---|
| 450 | vector of positional information. These are the <EM>within-document positions</EM> of |
|---|
| 451 | t, or the <EM>wdp</EM> information of t. |
|---|
| 452 | <P> |
|---|
| 453 | The <EM>within-document frequency</EM>, or <EM>wdf</EM>, of a term t in D is the number of |
|---|
| 454 | times it is pulled out of D in the indexing process. Usually this is the size |
|---|
| 455 | of the wdp vector, but in Xapian it can exceed it, since we can |
|---|
| 456 | apply extra wdf to some parts of the document text. For example, |
|---|
| 457 | often this is done for the document title and abstract to |
|---|
| 458 | attach extra importance to their contents compared to the |
|---|
| 459 | rest of the document text. |
|---|
| 460 | <P> |
|---|
| 461 | There are various ways in which we might measure the length of a document, but |
|---|
| 462 | the easiest is to suppose it is made up of m words, w<sub>1</sub> to w<sub>m</sub>, and to define |
|---|
| 463 | its length as m. |
|---|
| 464 | <P> |
|---|
| 465 | The <EM>normalised document length</EM>, or <EM>ndl</EM>, is then m divided by the average |
|---|
| 466 | length of the documents in the IR system. So the average length document has |
|---|
| 467 | ndl equal to 1, short documents are less than 1, long documents greater than |
|---|
| 468 | 1. We have found that very small ndl values create problems, so Xapian |
|---|
| 469 | actually allows for a non-zero minimum value for the ndl. |
|---|
| 470 | |
|---|
| 471 | <P> |
|---|
| 472 | In the probabilistic model the query, <I>Q</I>, is itself very much like another |
|---|
| 473 | document. Frequently indeed <I>Q</I> will be created from a document, either one |
|---|
| 474 | already in the IR system, or by an indexing process very similar to the one |
|---|
| 475 | used to add documents into the whole IR system. This corresponds to a user |
|---|
| 476 | saying "give me other documents like this one". One can therefore attach a |
|---|
| 477 | similar meaning to within-query position information, within-query frequency, |
|---|
| 478 | and normalised query length, or wqp, wqf and nql. Xapian does not currently |
|---|
| 479 | use the concept of wqp. |
|---|
| 480 | |
|---|
| 481 | |
|---|
| 482 | <H2>Using the weights. The <I>MSet</I></H2> |
|---|
| 483 | |
|---|
| 484 | Now to pull everything together. From the probabilistic term weights we can |
|---|
| 485 | assign a weight to any document, d, as follows, |
|---|
| 486 | |
|---|
| 487 | <blockquote> |
|---|
| 488 | <table border=0><tr valign=center> |
|---|
| 489 | <td><tt>W(d) = </tt></td> |
|---|
| 490 | <td> |
|---|
| 491 | <tt><center> |
|---|
| 492 | <font size="+4">Σ</font><br><small>t -> d, t in <i>Q</i></small></tt> |
|---|
| 493 | </td> |
|---|
| 494 | <td><tt><center> |
|---|
| 495 | <u>(k + 1) f</u><sub>t</sub><br>k.L<sub>d</sub> + f<sub>t</sub> |
|---|
| 496 | </center></tt></td> |
|---|
| 497 | <td><tt> w(t)</tt></td> |
|---|
| 498 | </tr></table> |
|---|
| 499 | </blockquote> |
|---|
| 500 | |
|---|
| 501 | The sum extends over the terms of <I>Q</I> which index d. f<sub>t</sub> is the wdf of t in d, |
|---|
| 502 | L<sub>d</sub> is the ndl of d, and k is some suitably chosen constant. |
|---|
| 503 | <P> |
|---|
| 504 | The factor k+1 is actually redundant, but helps with the interpretation of |
|---|
| 505 | the equation. In Xapian, this weighting scheme is implemented by the |
|---|
| 506 | <a href="apidoc/html/classXapian_1_1TradWeight.html">Xapian::TradWeight class</a> |
|---|
| 507 | and the factor (k+1) is ignored. |
|---|
| 508 | |
|---|
| 509 | <P> |
|---|
| 510 | If k is set to zero the factor before w(t) is 1, and the wdfs |
|---|
| 511 | are ignored. As k tends to infinity, the factor becomes f<sub>t</sub>/L<sub>d</sub>, and the wdfs |
|---|
| 512 | take on their greatest importance. Intermediate values scale the wdf |
|---|
| 513 | contribution between these extremes. |
|---|
| 514 | The best k actually depends on the characteristics of the IR system as a |
|---|
| 515 | whole, and unfortunately no rule can be given for choosing it. |
|---|
| 516 | By default, Xapian sets k to 1 which should give reasonable results |
|---|
| 517 | for most systems. W(d) is merely tweaked a bit by the |
|---|
| 518 | wdf values, and users observe a simple pattern of retrieval. |
|---|
| 519 | It is possible to tune k to provide optimal results for a |
|---|
| 520 | specific system. |
|---|
| 521 | <P> |
|---|
| 522 | Any d in the IR system has a value W(d), but, if no term of the query indexes |
|---|
| 523 | d, W(d) will be zero. In practice only documents for which W(d) > 0 will be of |
|---|
| 524 | interest, and these are the documents indexed by at least one term of <I>Q</I>. If |
|---|
| 525 | we now take these documents and arrange them by decreasing W(d) value, we get |
|---|
| 526 | a ranked list called the <EM>match set</EM>, or <EM>MSet</EM>, of document and weight |
|---|
| 527 | pairs: |
|---|
| 528 | <PRE> |
|---|
| 529 | MSet: item 0: D<sub>0</sub> W(D<sub>0</sub>) |
|---|
| 530 | item 1: D<sub>1</sub> W(D<sub>1</sub>) |
|---|
| 531 | item 2: D<sub>2</sub> W(D<sub>2</sub>) |
|---|
| 532 | .... |
|---|
| 533 | item K: D<sub>K</sub> W(D<sub>K</sub>) |
|---|
| 534 | </PRE> |
|---|
| 535 | |
|---|
| 536 | where W(D<sub>j</sub>) >= W(D<sub>i</sub>) if j > i. |
|---|
| 537 | <P> |
|---|
| 538 | And according to the probabilistic model, the documents D<sub>0</sub>, D<sub>1</sub>, D<sub>2</sub> ... are |
|---|
| 539 | ranked by decreasing order of probability of relevance. So D<sub>0</sub> has highest |
|---|
| 540 | probability of being relevant, then D<sub>1</sub> and so on. |
|---|
| 541 | <P> |
|---|
| 542 | Xapian creates the MSet from the posting lists of the terms |
|---|
| 543 | of the query. This is the central operation of any IR system, and will be |
|---|
| 544 | familiar to anyone who has used one of the Internet's major search engines, |
|---|
| 545 | where the query is what you type in the query box, and the resulting hit list |
|---|
| 546 | corresponds to the top few items of the MSet. |
|---|
| 547 | <P> |
|---|
| 548 | |
|---|
| 549 | The cutoff point, K, is chosen when the MSet is created. The candidates for |
|---|
| 550 | inclusion in the MSet are all documents indexed by at least one term of <I>Q</I>, |
|---|
| 551 | and their number will usually exceed the choice of K (K is typically set to |
|---|
| 552 | be 1000 or less). So the MSet is actually the best K documents found in the |
|---|
| 553 | match process. |
|---|
| 554 | <P> |
|---|
| 555 | |
|---|
| 556 | A modification of this weighting scheme can be employed that takes into |
|---|
| 557 | account the query itself: |
|---|
| 558 | |
|---|
| 559 | <blockquote> |
|---|
| 560 | <table border=0><tr valign=center> |
|---|
| 561 | <td><tt>W(d) = </tt></td> |
|---|
| 562 | <td> |
|---|
| 563 | <tt><center> |
|---|
| 564 | <font size="+4">Σ</font><br><small>t -> d, t in <i>Q</i></small></center></tt> |
|---|
| 565 | </td> |
|---|
| 566 | <td><tt><center> |
|---|
| 567 | <u>(k<sub>3</sub> + 1) q</u><sub>t</sub><br>k<sub>3</sub>L' + q<sub>t</sub> |
|---|
| 568 | </center></tt></td> |
|---|
| 569 | <td><tt> </tt></td> |
|---|
| 570 | <td><tt><center> |
|---|
| 571 | <u>(k + 1) f</u><sub>t</sub><br>kL<sub>d</sub> + f<sub>t</sub> |
|---|
| 572 | </center></tt></td> |
|---|
| 573 | <td><tt> w(t)</tt></td> |
|---|
| 574 | </tr></table> |
|---|
| 575 | </blockquote> |
|---|
| 576 | |
|---|
| 577 | where q<sub>t</sub> is the wqf of t in <I>Q</I>, L' is the nql, or normalised query length, |
|---|
| 578 | and k<sub>3</sub> is a further constant. In computing W(d) across the document space, |
|---|
| 579 | this extra factor may be viewed as just a modification to the basic term |
|---|
| 580 | weights, w(t). Like k and k<sub>3</sub>, we will need to make an inspired guess for L'. |
|---|
| 581 | In fact the choices for k<sub>3</sub> and L' will depend on the broader context of the |
|---|
| 582 | use of this formula, and more advice will be given as occasion arises. |
|---|
| 583 | |
|---|
| 584 | <p> |
|---|
| 585 | Xapian's default weighting scheme is a generalised form of this weighting |
|---|
| 586 | scheme modification, known as <a href="bm25.html">BM25</a>. In BM25, L' is |
|---|
| 587 | always set to 1. |
|---|
| 588 | |
|---|
| 589 | <H2>Using the weights: the <I>ESet</I></H2> |
|---|
| 590 | |
|---|
| 591 | But as well as ranking documents, Xapian can rank terms, and this is most |
|---|
| 592 | important. |
|---|
| 593 | The higher up the ranking the term is, the more likely it is |
|---|
| 594 | to act as a good differentiator between relevant and non-relevant documents. |
|---|
| 595 | It is therefore a candidate for adding back into the query. Terms from this |
|---|
| 596 | list can therefore be used to expand the size of the query, after which the |
|---|
| 597 | query can be re-run to get a better MSet. Because this list of terms is |
|---|
| 598 | mainly used for query expansion, it is called the <EM>expand set</EM> or <EM>ESet</EM>. |
|---|
| 599 | |
|---|
| 600 | <P> |
|---|
| 601 | The term expansion weighting formula is as follows, |
|---|
| 602 | |
|---|
| 603 | <blockquote> |
|---|
| 604 | <PRE> |
|---|
| 605 | W(t) = r w(t) |
|---|
| 606 | </PRE> |
|---|
| 607 | </blockquote> |
|---|
| 608 | |
|---|
| 609 | in other words we multiply the term weight by the number of |
|---|
| 610 | relevant documents that have been indexed by the term. |
|---|
| 611 | <P> |
|---|
| 612 | The ESet then has this form, |
|---|
| 613 | |
|---|
| 614 | <PRE> |
|---|
| 615 | ESet: item 0: t<sub>0</sub> W(t<sub>0</sub>) |
|---|
| 616 | item 1: t<sub>1</sub> W(t<sub>1</sub>) |
|---|
| 617 | item 2: t<sub>2</sub> W(t<sub>2</sub>) |
|---|
| 618 | .... |
|---|
| 619 | item K: t<sub>K</sub> W(t<sub>K</sub>) |
|---|
| 620 | </PRE> |
|---|
| 621 | |
|---|
| 622 | where W(t<sub>j</sub>) >= W(t<sub>i</sub>) if j > i. |
|---|
| 623 | <P> |
|---|
| 624 | Since the main function of the ESet is to find new terms to be added to <I>Q</I>, |
|---|
| 625 | we usually omit from it terms already in <I>Q</I>. |
|---|
| 626 | <P> |
|---|
| 627 | The W(t) weight is applicable to any term in the IR system, but has a value |
|---|
| 628 | zero when t does not index a relevant document. The ESet is therefore |
|---|
| 629 | confined to be a ranking of the best K terms which index relevant documents. |
|---|
| 630 | <P> |
|---|
| 631 | This simple form of W(t) is traditional in the probabilistic model, but |
|---|
| 632 | seems less than optimal because it does not take into account wdf information. |
|---|
| 633 | One can if fact try to generalise it to |
|---|
| 634 | |
|---|
| 635 | <blockquote> |
|---|
| 636 | <table border=0><tr valign=center> |
|---|
| 637 | <td><tt>W(t) = </tt></td> |
|---|
| 638 | <td> |
|---|
| 639 | <tt><center> |
|---|
| 640 | <font size="+4">Σ</font><br><small>t -> d, d in <i>R</i></small></tt> |
|---|
| 641 | </td> |
|---|
| 642 | <td><tt><center> |
|---|
| 643 | <u>(k + 1) f</u><sub>t</sub><br>kL + f<sub>t</sub> |
|---|
| 644 | </center></tt></td> |
|---|
| 645 | <td><tt> w(t)</tt></td> |
|---|
| 646 | </tr></table> |
|---|
| 647 | </blockquote> |
|---|
| 648 | |
|---|
| 649 | k is again a constant, but it does not need to have the same |
|---|
| 650 | value as the k used in the probabilistic term weights above. |
|---|
| 651 | <!-- FIXME: Might it have the same value? If not, better to rename it? |
|---|
| 652 | Is it be sensible to set it the same as the other k if you change that? --> |
|---|
| 653 | In Xapian, k defaults to 1.0 for ESet generation. |
|---|
| 654 | <P> |
|---|
| 655 | This reduces to W(t) = r w(t) when k = 0. Certainly this form can be |
|---|
| 656 | recommended in the very common case where r = 1, that is, we have a single |
|---|
| 657 | document marked relevant. |
|---|
| 658 | |
|---|
| 659 | <H2>The progress of a query</H2> |
|---|
| 660 | |
|---|
| 661 | <p> |
|---|
| 662 | Below we describe the general case of the IR model supported, including |
|---|
| 663 | use of a relevance set (<a href="glossary.html#rset">RSet</a>), |
|---|
| 664 | query expansion, improved term weights and reranking. |
|---|
| 665 | You don't have to use any of these for Xapian to be useful, but they are |
|---|
| 666 | available should you need them. |
|---|
| 667 | </p> |
|---|
| 668 | |
|---|
| 669 | <p> |
|---|
| 670 | The user enters a query. This is parsed into a form the IR system understands, |
|---|
| 671 | and run by the IR system, which returns two lists, a |
|---|
| 672 | list of captions, derived from the MSet, and a list of terms, from the ESet. |
|---|
| 673 | If the RSet is empty, the first few documents of the MSet can be |
|---|
| 674 | used as a stand-in - after all, they have a good chance of being relevant! You |
|---|
| 675 | can read a document by clicking on the caption. (We assume the usual |
|---|
| 676 | screen/mouse environment.) But you can also mark a document as relevant |
|---|
| 677 | (change <I>R</I>) or cause a term to be added from the ESet to the query (change |
|---|
| 678 | <I>Q</I>). As soon as any change is made to the query environment the query can be |
|---|
| 679 | rerun, although you might have a front-end where nothing happens until you |
|---|
| 680 | click on some "Run Query" button. |
|---|
| 681 | </p> |
|---|
| 682 | |
|---|
| 683 | <p> |
|---|
| 684 | In any case rerunning the query leads to a new MSet and ESet, and so to a |
|---|
| 685 | new display. The IR process is then an iterative one. You can delete terms |
|---|
| 686 | from the query or add them in; mark or unmark documents as being relevant. |
|---|
| 687 | Eventually you converge on the answer to the query, or at least, the best |
|---|
| 688 | answer the IR system can give you. |
|---|
| 689 | </p> |
|---|
| 690 | |
|---|
| 691 | <H2>Further Reading</H2> |
|---|
| 692 | |
|---|
| 693 | <P>If you want to find out more, then |
|---|
| 694 | <a href="http://citeseer.ist.psu.edu/robertson97simple.html"><i>"Simple, |
|---|
| 695 | proven approaches to text retrieval"</i></a> is a worthwhile read. It's a |
|---|
| 696 | good introduction to Probabilistic Information retrieval, |
|---|
| 697 | which is basically what Xapian provides. |
|---|
| 698 | |
|---|
| 699 | <P> |
|---|
| 700 | There are also several good books on the subject of Information retrieval. |
|---|
| 701 | |
|---|
| 702 | <UL> |
|---|
| 703 | <LI>"<em>Information Retrieval</em>" by C. J. van |
|---|
| 704 | Rijsbergen is well worth reading. It's out of print, but is available |
|---|
| 705 | for free <A |
|---|
| 706 | HREF="http://www.dcs.gla.ac.uk/Keith/Preface.html">from the author's website</A> |
|---|
| 707 | (in HTML or PDF). |
|---|
| 708 | |
|---|
| 709 | <LI> "<em>Readings in Information Retrieval</em>" (published by |
|---|
| 710 | Morgan Kaufmann, edited by Karen Sparck Jones and Peter Willett) is a |
|---|
| 711 | collection of published papers covering many aspects of the subject. |
|---|
| 712 | |
|---|
| 713 | <LI> "<em>Managing Gigabytes</em>" (also published by |
|---|
| 714 | Morgan Kaufmann, written by Ian H. Witten, Alistair Moffat and |
|---|
| 715 | Timothy C. Bell) describes information retrieval and compression techniques. |
|---|
| 716 | |
|---|
| 717 | <LI> "<em>Modern Information Retrieval</em>" (published by |
|---|
| 718 | Addison Wesley, written by Ricardo Baeza-Yates and Berthier Ribeiro-Neto) |
|---|
| 719 | gives a good overview of the field. It was published more recently than |
|---|
| 720 | the books above, and so covers some more recent developments. |
|---|
| 721 | |
|---|
| 722 | <li> "<em>Introduction to Information Retrieval</em>" (<b>to |
|---|
| 723 | be published in 2008</b> by Cambridge University Press, written by |
|---|
| 724 | Christopher D. Manning, Prabhakar Raghavan and Hinrich Schiütze) |
|---|
| 725 | looks to be a good introductory work (we've not read it in detail yet). |
|---|
| 726 | It's not yet published, but the latest draft can be read online at |
|---|
| 727 | <a href="http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html">the |
|---|
| 728 | book's companion website</a> which says the online version will remain |
|---|
| 729 | after publication. |
|---|
| 730 | |
|---|
| 731 | </UL> |
|---|
| 732 | |
|---|
| 733 | <!-- FOOTER $Author$ $Date$ $Id$ --> |
|---|
| 734 | </BODY> |
|---|
| 735 | </HTML> |
|---|