root / tags / 1.0.8 / xapian-core / docs / overview.html

Revision 10007, 38.6 kB (checked in by olly, 12 months ago)

docs/overview.html: Remove commented-out comment about OP_XOR.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<HTML>
3<HEAD>
4<TITLE>Xapian: Overview</TITLE>
5</HEAD>
6<BODY BGCOLOR="white" TEXT="black">
7
8<H1>Overview</H1>
9
10<P>
11This document provides an introduction to the native C++ Xapian API.
12This API provides programmers with the ability to index and search through
13(potentially very large) bodies of data using probabilistic methods.
14</P>
15
16<P>
17<EM>Note:</EM>
18The portion of the API currently documented here covers only the part
19of Xapian concerned with searching through existing databases, not that
20concerned with creating them.
21</P>
22
23<P>
24This document assumes you already have Xapian installed, so if you
25haven't, it is a good idea to read <A HREF="install.html">Installing Xapian</A> first.
26</P>
27
28<P>
29You may also wish to read
30the <A HREF="quickstart.html">QuickStart</A> reference, for some simple
31worked examples of Xapian usage, and the
32<A HREF="intro_ir.html">Introduction to Information Retrieval</A> for a
33background into the Information Retrieval theories behind Xapian.
34</P>
35
36<P>
37This document does not detail the exact calling conventions (parameters
38passed, return value, exceptions thrown, etc...) for each method in the API.
39For such documentation, you should refer to the automatically extracted
40documentation, which is generated from detailed comments in the source code,
41and should thus remain up-to-date and accurate.  This documentation is
42generated using the
43<EM><A HREF="http://www.doxygen.org/">Doxygen</A></EM>
44application.  To save you having to generate this documentation yourself,
45we include the <A HREF="apidoc/html/index.html">built version</A>
46in our distributions, and also keep the
47<A HREF="http://xapian.org/docs/apidoc/html/index.html">latest version</A> on our website.
48</P>
49
50<H2>Design Principles</H2>
51
52<P>
53API classes are either very lightweight or a wrapper around a reference counted
54pointer (this style of class design is sometimes known as PIMPL for "Private
55IMPLementation").  In either case copying is a cheap operation as classes
56are at most a few words of memory.
57</P>
58
59<P>
60API objects keep a reference to other objects they rely on so the user
61doesn't need to worry about whether an object is still valid or not.
62</P>
63
64<P>
65Where appropriate, API classes can be used as containers and iterators just
66like those in the C++ STL.
67</P>
68
69<H2>Errors and exceptions</H2>
70
71<P>
72Error reporting is often relegated to the back of manuals such as this.
73However, it is extremely important to understand the errors which may be
74caused by the operations which you are trying to perform.
75</P>
76
77<P>
78This becomes particularly relevant when using a large system, with such
79possibilities as databases which are being updated while you search
80through them, and distributed enquiry systems.
81</P>
82
83<P>
84Errors in Xapian are all reported by means of exceptions.  All exceptions
85thrown by Xapian will be subclasses of
86<A HREF="apidoc/html/classXapian_1_1Error.html"><CODE>Xapian::Error</CODE></A>.  Note that
87<CODE>Xapian::Error</CODE> is an abstract class; thus you must catch exceptions
88by reference rather than by value.
89</P>
90
91<P>
92There are two flavours of error, derived from <CODE>Xapian::Error</CODE>:
93<UL><LI>
94<A HREF="apidoc/html/classXapian_1_1LogicError.html"><CODE>Xapian::LogicError</CODE></A>
95- for error conditions due to programming errors, such as a misuse of the
96API.  A finished application should not receive these errors (though it
97would still be sensible to catch them).
98</LI><LI>
99<A HREF="apidoc/html/classXapian_1_1RuntimeError.html"><CODE>Xapian::RuntimeError</CODE></A>
100- for error conditions due to run time problems, such as failure to open
101a database.  You must always be ready to cope with such errors.
102</LI></UL>
103</P>
104
105<P>
106Each of these flavours is further subdivided, such that any particular
107error condition can be trapped by catching the appropriate exception.
108If desired, a human readable explanation of the error can be retrieved
109by calling
110<A HREF="apidoc/html/classXapian_1_1Error.html"><CODE>Xapian::Error::get_msg()</CODE></A>.
111</P>
112
113<P>
114In addition, standard system errors may occur: these will be reported by
115throwing appropriate exceptions.  Most notably, if the system runs out
116of memory, a <CODE>std::bad_alloc()</CODE> exception will be thrown.
117</P>
118
119<H2>Terminology</H2>
120<H3>Databases</H3>
121<P>
122These may also occasionally be called <EM>Indexes</EM>.  In Xapian (as
123opposed to a database package) a database consists of little more than
124indexed documents: this reflects the purpose of Xapian as an information
125retrieval system, rather than an information storage system.
126</P>
127<P>
128The exact contents of a database depend on the type (see
129&quot;<A HREF="#database_types">Database Types</A>&quot; for more details
130of the database types currently provided).
131</P>
132
133<H3>Queries</H3>
134<P>
135The information to be searched for is specified by a <EM>Query</EM>.  In
136Xapian, queries are made up of a structured boolean tree, upon which
137probabilistic weightings are imposed: when the search is performed, the
138documents returned are filtered according to the boolean structure, and
139weighted (and sorted) according to the probabilistic model of information
140retrieval.
141</P>
142
143<H2>Memory handling</H2>
144<P>
145The user of Xapian does not usually need to worry about how Xapian performs
146its memory allocation: Xapian objects can all be created and deleted as any
147other C++ objects.  The convention is that whoever creates an object
148is ultimately responsible for deleting it.  This becomes relevant when
149passing a pointer to data to Xapian: Xapian will not assume that such
150pointers remain valid across separate API calls, and it will be the
151callers responsibility to delete the object pointed to, as and when
152required.
153</P>
154
155<H2>The Xapian::Enquire class</H2>
156
157<P>
158The <A HREF="apidoc/html/classXapian_1_1Enquire.html"><CODE>Xapian::Enquire</CODE></A> class
159is central to all searching operations.  It provides an interface for
160<UL><LI>
161Specifying the database, or databases, to search across.
162</LI><LI>
163Specifying a query to perform.
164</LI><LI>
165Specifying a set of documents which a user considers relevant.
166</LI><LI>
167Given the supplied information, returning a ranked set of documents for
168the user.
169</LI><LI>
170Given the supplied information, suggesting a ranked set of terms to add to the
171query.
172</LI><LI>
173Returning information about the documents which matched, such as their
174associated data, and which terms from the query were found within them.
175</LI></UL>
176</P>
177<P>
178A typical enquiry session will consist of most of these operations, in
179various orders.  The Xapian::Enquire class presents as few restrictions as
180possible on the order in which operations should be performed.  Although
181you must set the query before any operation which uses it, you can call
182any of the other methods in any order.
183</P>
184<P>
185Many operations performed by the Xapian::Enquire class are performed lazily (ie,
186just before their results are needed).  This need not concern the user
187except to note that, as a result, errors may not be reported as soon as
188would otherwise be expected.
189</P>
190
191<H2>Specifying a database</H2>
192
193<P>
194When creating a Xapian::Enquire object, a database to search must be specified.
195Databases are specified by creating a <A
196HREF="apidoc/html/classXapian_1_1Database.html"><CODE>Xapian::Database</CODE> object</A>.
197Generally, you can just construct the object, passing the pathname to the
198database.  Xapian looks at the path and autodetects the database type.
199</P>
200<P>
201In some cases (with the Remote backend, or if you want more control) you
202need to use a factory function such as <CODE>Xapian::Flint::open()</CODE>
203- each backend type has one or more.  The parameters the function
204takes depend on the backend type, and whether we are creating a read-only
205or a writable database.
206</P>
207</P>
208<P>
209You can also create a "stub database" file which list one or more databases.
210These files are recognised by the autodetection in the Database constructor
211(if the pathname is file rather than a directory, it's treated as a stub
212database file) or you can open them explicitly using Xapian::Auto::open_stub().
213The stub database format specifies one database per line.  For example:
214
215<BLOCKQUOTE><CODE>
216remote localhost:23876<br>
217flint /var/spool/xapian/webindex<br>
218</CODE></BLOCKQUOTE>
219
220<A NAME="database_types"><H3>Database types</H3></A>
221The current types understood by Xapian are:
222</P>
223<TABLE>
224<TR><TD VALIGN="top"><B>auto</B></TD><TD>
225<P>
226This isn't an actual database format, but rather auto-detection of one of the
227disk based backends ("flint", "quartz", or "stub") from a single specified
228file or directory path.
229</P>
230</TD></TR>
231<TR><TD VALIGN="top"><B>flint</B></TD><TD>
232<P>
233Flint is a default backend as of Xapian 1.0. It supports incremental
234modifications, concurrent single-writer and multiple-reader access to
235a database.  It's very efficient and highly scalable.  Flint takes lessons
236learned from studying Quartz in action, and is appreciably faster
237(both when indexing and searching), more compact, and features an
238improved locking mechanism which automatically releases the lock
239if a writing process dies.
240</P>
241<!--
242<P>
243Flint is very much a work in progress. The aim is to have it stable and working at any given point (and Xapian's extensive test suite should help give us some reassurance of this), but the database format will change frequently and there'll be no migration path during development (except for rebuilding your index from the source data, or alternatively using copydatabase to copy the old-flint database to a quartz database, upgrading, then using copydatabase to copy the quartz database to a new-flint one).
244</P>
245<P>
246That said flint already outperforms quartz, and Gmane, tweakers.net and
247srpko.com are all running production systems using the flint backend.
248</P>
249-->
250<P>
251For more information, see the <a href="http://wiki.xapian.org/FlintBackend">Xapian Wiki</a>.
252</P>
253</TD></TR>
254<TR><TD VALIGN="top"><B>quartz</B></TD><TD>
255<P>
256Quartz was the default backend prior to Xapian 1.0.  New installations should
257use Flint, and existing installations should consider migrating to Flint.
258Support for Quartz will be dropped at some point in the future.
259</P>
260</TD></TR>
261<TR><TD VALIGN="top"><B>inmemory</B></TD><TD>
262This type is a database held entirely in memory.
263It was originally written for testing purposes only, but may
264prove useful for building up temporary small databases.
265</TD></TR>
266</TABLE>
267
268<H3>Multiple databases</H3>
269
270<P>
271Xapian can search across several databases as easily as searching across a
272single one.  Simply call
273<A HREF="apidoc/html/classXapian_1_1Database.html"><CODE>Xapian::Database::add_database()</CODE></A>
274for each database that you wish to search through.
275</P>
276<P>
277You can also set up "pre-canned" listed of databases to search over using
278a "stub database" - see above for details.
279</P>
280<!-- I don't really think this says anything useful...
281<P>
282Other operations, such as setting the query, may be performed before or after
283this call.  It is even possible to perform a query, add a further database,
284and then perform the query again to get the results with the extra database
285(although this isn't very likely to be useful in practice).
286</P>-->
287
288<H2>Specifying a query</H2>
289
290<P>
291Xapian implements both boolean and probabilistic searching.
292There are two obvious ways in which a pure boolean query can be combined
293with a pure probabilistic query:
294<UL><LI>
295First perform the boolean search to create a subset of the whole document
296collection, and then do the probabilistic search on this subset, or
297</LI><LI>
298Do the probabilistic search, and then filter out the resulting documents
299with a boolean query.
300</LI></UL>
301There is in fact a subtle difference in these two approaches. In the first,
302the collection statistics for the probabilistic query will be
303determined by the document subset which is obtained by running the boolean
304query. In the second, the collection statistics for the probabilistic
305query are determined by the whole document collection. These differences
306can affect the final result.
307
308</P>
309<P>
310Suppose for example the boolean query is
311being used to retrieve documents in English in a database
312containing English and French documents.
313A word like
314&quot;<EM>grand</EM>&quot;,
315exists in both languages (with similar meanings), but is more common in French
316than English. In the English subset it could therefore be expected to have a higher
317weight than it would get in the joint English and French databases.
318</P>
319
320<P>
321Xapian takes the second approach simply because this can be implemented very
322efficiently.  The first approach is more exact, but inefficient to implement.
323</P>
324
325<P>
326Rather than implementing this approach as described above and first
327performing the probabilistic search and then filtering the results, Xapian
328actually performs both tasks
329simultaneously.  This allows various optimisations to be performed, such
330as giving up on calculating a boolean AND operation when the probabilistic
331weights that could result from further documents can have no effect on the
332result set.  These optimisations have been found to often give a several-fold
333performance increase.  The performance is particularly good for queries
334containing many terms.
335</P>
336
337<H3>A query for a single term</H3>
338<P>
339A search query is represented by a
340<A HREF="apidoc/html/classXapian_1_1Query.html"><CODE>Xapian::Query</CODE></A>
341object.  The simplest useful query is one which searches for a single term
342(and several of these can be combined to form more complex queries).
343A single term query can be created as follows (where <CODE>term</CODE> is a
344<CODE>std::string</CODE> holding the term to be searched for):
345</P>
346<PRE>
347Xapian::Query query(term);
348</PRE>
349<P>
350A term in Xapian is represented simply by a string of binary characters.
351Usually, when searching text, these characters will be the word which the
352term represents, but during the information retrieval process Xapian
353attaches no specific meaning to the term.
354</P>
355<P>
356This constructor actually takes a couple of extra parameters, which may be
357used to specify positional and frequency information for terms in the query:
358<P>
359<PRE>
360Xapian::Query(const string &amp; tname_,
361        Xapian::termcount wqf_ = 1,
362        Xapian::termpos term_pos_ = 0)
363</PRE>
364<P>
365The <CODE>wqf</CODE> (<B>W</B>ithin <B>Q</B>uery <B>F</B>requency) is
366a measure of how common a term is in the query.  This isn't useful for
367a single term query unless it is going to be combined to form a more
368complex query.  In that case, it's particularly useful
369when generating a query from an existing document, but may also be used
370to increase the "importance" of a term in a query.  Another way to
371increase the "importance" of a term is to use <code>OP_SCALE_WEIGHT</code>.
372But if the intention is simply to ensure that a particular term is in the query
373results, you should use a boolean AND or AND_MAYBE rather than setting a high wqf.
374</P>
375<P>
376The <CODE>term_pos</CODE> represents the position of the term in the query.
377Again, this isn't useful for a single term query by itself, but is used for
378phrase searching, passage retrieval, and other operations
379which require knowledge of the order of terms in the query (such as returning
380the set of matching terms in a given document in the same order as they
381occur in the query).  If such operations are not required, the default
382value of 0 may be used.
383</P>
384<P>
385Note that it may not make much sense to specify a wqf other than 1 when
386supplying a term position (unless you are trying to affect the weighting,
387as previously described).
388</P>
389<P>
390Note also that the results of <CODE>Xapian::Query(tname, 2)</CODE> and
391<CODE>Xapian::Query(Xapian::Query::OP_OR, Xapian::Query(tname), Xapian::Query(tname))</CODE>
392are exactly equivalent.
393</P>
394
395<H3>Compound queries</H3>
396<P>
397Compound queries can be built up from single term queries by combining
398them a connecting operator. Most operators can operate on either
399a single term query or a compound query. You can combine pair-wise
400using the following constructor:
401</P>
402<PRE>
403Xapian::Query(Xapian::Query::op op_,
404        const Xapian::Query &amp; left,
405        const Xapian::Query &amp; right)
406</PRE>
407<P>
408The two most commonly used operators are <CODE>Xapian::Query::OP_AND</CODE> and
409<CODE>Xapian::Query::OP_OR</CODE>, which enable us to construct boolean queries made
410up from the usual AND and OR operations. But in addition to this, a
411probabilistic query in its simplest form, where we have a list of terms
412which give rise to weights that need to be added together, is also made up
413from a set of terms joined together with <CODE>Xapian::Query::OP_OR</CODE>.
414</P>
415<P>
416The full set of available <CODE>Xapian::Query::op</CODE> operators is:
417<TABLE>
418<TR><TD VALIGN="top">
419Xapian::Query::OP_AND
420</TD><TD>
421Return documents returned by both subqueries.
422</TD></TR><TR><TD VALIGN="top">
423Xapian::Query::OP_OR
424</TD><TD>
425Return documents returned by either subquery.
426</TD></TR><TR><TD VALIGN="top">
427Xapian::Query::OP_AND_NOT
428</TD><TD>
429Return documents returned by the left subquery but not the right subquery.
430</TD></TR><TR><TD VALIGN="top">
431Xapian::Query::OP_FILTER
432</TD><TD>
433As Xapian::Query::OP_AND, but use only weights from left subquery.
434</TD></TR><TR><TD VALIGN="top">
435Xapian::Query::OP_AND_MAYBE
436</TD><TD>
437Return documents returned by the left subquery, but adding
438document weights from both subqueries.
439</TD></TR><TR><TD VALIGN="top">
440Xapian::Query::OP_XOR
441</TD><TD>
442Return documents returned by one subquery only.
443</TD></TR><TR><TD VALIGN="top">
444Xapian::Query::OP_NEAR
445</TD><TD>
446Return documents where the terms are with the specified distance of each other.
447</TD></TR><TR><TD VALIGN="top">
448Xapian::Query::OP_PHRASE
449</TD><TD>
450Return documents where the terms are with the specified distance of each other
451and in the given order.
452</TD></TR><TR><TD VALIGN="top">
453Xapian::Query::OP_ELITE_SET
454</TD><TD>
455Select an elite set of terms from the subqueries, and perform
456a query with all those terms combined as an OR query.
457</TD></TR>
458</TABLE>
459</P>
460
461
462<H3>Understanding queries</H3>
463
464<P>
465Each term in the query has a weight in each document.  Each document may also
466have an additional weight not associated with any of the terms.  By default
467the probabilistic weighting scheme <a href="bm25.html">BM25</a>
468is used to provide the formulae which
469give these weights.
470</P>
471<P>
472A query can be thought of as a tree structure. At each node is
473an <CODE>Xapian::Query::op</CODE> operator, and on the left and right branch are two other queries.
474At each leaf node is a term, t, transmitting documents and scores, D and
475w<sub>D</sub>(t),
476up the tree.
477</P>
478<P>
479A Xapian::Query::OP_OR node transmits documents from both branches up the tree, summing the scores
480when a document is found in both the left and right branch. For example,
481
482<PRE>
483                           docs       1    8    12    16    17    18
484                           scores    7.3  4.1   3.2  7.6   3.8   4.7 ...
485                             |
486                             |
487                   Xapian::Query::OP_OR
488                         /       \
489                        /         \
490                       /           \
491                      /             \
492   docs     1   12   16   17         1   8   16   18
493   scores  3.1 3.2  3.1  3.8 ...    4.2 4.1 4.5  4.7 ...
494</PRE>
495
496A Xapian::Query::OP_AND node transmits only the documents found on both
497branches up the tree, again summing the scores,
498
499<PRE>
500                           docs       1   16
501                           scores    7.3  7.6  ...
502                             |
503                             |
504                   Xapian::Query::OP_AND
505                         /       \
506                        /         \
507                       /           \
508                      /             \
509   docs     1   12   16   17         1   8   16   18
510   scores  3.1 3.2  3.1  3.8 ...    4.2 4.1 4.5  4.7 ...
511</PRE>
512
513A Xapian::Query::OP_AND_NOT node transmits up the tree the documents on the
514left branch which are not on the right branch. The scores are taken from the
515left branch. For example, again summing the scores,
516
517<PRE>
518                           docs       12   17
519                           scores    3.2  3.8 ...
520                             |
521                             |
522                 Xapian::Query::OP_AND_NOT
523                         /       \
524                        /         \
525                       /           \
526                      /             \
527   docs     1   12   16   17         1   8   16   18
528   scores  3.1 3.2  3.1  3.8 ...    4.2 4.1 4.5  4.7 ...
529</PRE>
530
531A Xapian::Query::OP_AND_MAYBE node transmits the documents up the tree from the
532left branch only, but adds in the score from the right branch for documents
533which occur on both branches.  For example,
534
535<PRE>
536                           docs       1    12   16   17
537                           scores    7.3  3.2  7.6  3.8 ...
538                             |
539                             |
540                Xapian::Query::OP_AND_MAYBE
541                         /       \
542                        /         \
543                       /           \
544                      /             \
545   docs     1   12   16   17         1   8   16   18
546   scores  3.1 3.2  3.1  3.8 ...    4.2 4.1 4.5  4.7 ...
547</PRE>
548
549Xapian::Query::OP_FILTER is like Xapian::Query::OP_AND, but weights are only
550transmitted from the left branch.  For example,
551
552<PRE>
553                           docs       1   16
554                           scores    3.1  3.1  ...
555                             |
556                             |
557                  Xapian::Query::OP_FILTER
558                         /       \
559                        /         \
560                       /           \
561                      /             \
562   docs     1   12   16   17         1   8   16   18
563   scores  3.1 3.2  3.1  3.8 ...    4.2 4.1 4.5  4.7 ...
564</PRE>
565Xapian::Query::OP_XOR is like Xapian::Query::OP_OR, but documents on both left
566and right branches are not transmitted up the tree. For example,
567
568<PRE>
569                           docs       8    12    17    18
570                           scores    4.1   3.2  3.8   4.7 ...
571                             |
572                             |
573                      Xapian::Query::OP_XOR
574                         /       \
575                        /         \
576                       /           \
577                      /             \
578   docs     1   12   16   17         1   8   16   18
579   scores  3.1 3.2  3.1  3.8 ...    4.2 4.1 4.5  4.7 ...
580</PRE>
581</P>
582<P>
583A query can therefore be thought of as a process for generating an MSet from
584the terms at the leaf nodes of the query. Each leaf node gives rise to a
585posting list of documents with scores. Each higher level node gives rise to a
586similar list, and the root node of the tree contains the final set of documents
587with scores (or weights), which are candidates for going into the MSet. The
588MSet contains the documents which get the highest weights, and they are held in
589the MSet in weight order.
590</P>
591<P>
592It is important to realise that within Xapian the structure of a query is
593optimised for best performance, and it undergoes various transformations as the
594query progresses. The precise way in which the query is built up is therefore
595of little importance to Xapian - for example, you can AND together terms
596pair-by-pair, or combine several using AND on a std::vector of terms, and
597Xapian will build the same structure internally.
598</P>
599
600<H3>Using queries</H3>
601<H4>Probabilistic queries </H4>
602A plain probabilistic query is created by connecting terms together with
603Xapian::Query::OP_OR operators. For example,
604
605<PRE>
606    Xapian::Query query("regulation"));
607    query = Xapian::Query(Xapian::Query::OP_OR, query, Xapian::Query("import"));
608    query = Xapian::Query(Xapian::Query::OP_OR, query, Xapian::Query("export"));
609    query = Xapian::Query(Xapian::Query::OP_OR, query, Xapian::Query("canned"));
610    query = Xapian::Query(Xapian::Query::OP_OR, query, Xapian::Query("fish"));
611</PRE>
612
613This creates a probabilistic query with terms `regulation', `import', `export',
614`canned' and `fish'.
615<P>
616In fact this style of creation is so common that there is the shortcut
617construction:
618
619<PRE>
620    vector &lt;string&gt; terms;
621    terms.push_back("regulation");
622    terms.push_back("import");
623    terms.push_back("export");
624    terms.push_back("canned");
625    terms.push_back("fish");
626
627    Xapian::Query query(Xapian::Query::OP_OR, terms.begin(), terms.end());
628</PRE>
629<H4>Boolean queries</H4>
630Suppose now we have this Boolean query,
631<PRE>
632    ('EEC' - 'France') and ('1989' or '1991' or '1992') and 'Corporate Law'
633</PRE>
634
635This could be built up as bquery like this,
636
637<PRE>
638    Xapian::Query bquery1(Xapian::Query::OP_AND_NOT, "EEC", "France");
639
640    Xapian::Query bquery2("1989");
641    bquery2 = Xapian::Query(Xapian::Query::OP_OR, bquery2, "1991");
642    bquery2 = Xapian::Query(Xapian::Query::OP_OR, bquery2, "1992");
643
644    Xapian::Query bquery3("Corporate Law");
645
646    Xapian::Query bquery(Xapian::Query::OP_AND, bquery1, Xapian::Query(Xapian::Query::OP_AND(bquery2, bquery3)));
647</PRE>
648
649and this can be attached as a filter to <code>query</code> to run the
650probabilistic query with a Boolean filter,
651
652<PRE>
653    query = Xapian::Query(Xapian::Query::OP_FILTER, query, bquery);
654</PRE>
655
656If you want to run a pure boolean query, then set BoolWeight as the weighting
657scheme (by calling Enquire::set_weighting_scheme() with argument BoolWeight()).
658<H4>Plus and minus terms </H4>
659<P>
660A common requirement in search engine functionality is to run a
661probabilistic query where some terms are required to index all the
662retrieved documents (`+' terms), and others are required to
663index none of the retrieved documents (`-' terms). For example,
664
665<PRE>
666    regulation import export +canned +fish -japan
667</PRE>
668
669the corresponding query can be set up by,
670
671<PRE>
672    vector &lt;string&gt; plus_terms;
673    vector &lt;string&gt; minus_terms;
674    vector &lt;string&gt; normal_terms;
675
676    plus_terms.push_back("canned");
677    plus_terms.push_back("fish");
678
679    minus_terms.push_back("japan");
680
681    normal_terms.push_back("regulation");
682    normal_terms.push_back("import");
683    normal_terms.push_back("export");
684
685    Xapian::Query query(Xapian::Query::OP_AND_MAYBE,
686                  Xapian::Query(Xapian::Query::OP_AND, plus_terms.begin(), plus_terms.end());
687                  Xapian::Query(Xapian::Query::OP_OR, normal_terms.begin(), normal_terms.end()));
688
689    query = Xapian::Query(Xapian::Query::OP_AND_NOT,
690                    query,
691                    Xapian::Query(Xapian::Query::OP_OR, minus_terms.begin(), minus_terms.end()));
692</PRE>
693
694<H3>Undefined queries</H3>
695<P>
696Performing a match with an undefined query matches nothing, which is sometimes
697useful.  However an undefined query can't be used with operators to compose
698a query.
699</P>
700
701<H2>Retrieving the results of a query</H2>
702
703<P>
704The Xapian::Enquire class does not require that a method be called in order to
705perform the query.  Rather, you simply ask for the results of a query,
706and it will perform whatever calculations are necessary to provide the
707answer:
708</P>
709<PRE>
710Xapian::MSet <A HREF="apidoc/html/classXapian_1_1Enquire.html">Xapian::Enquire::get_mset</A>(Xapian::doccount first,
711                           Xapian::doccount maxitems,
712                           const Xapian::RSet * omrset = 0,
713                           const Xapian::MatchDecider * mdecider = 0) const
714<!-- FIXME check parameters -->
715</PRE>
716<P>
717When asking for the results, you must specify (in <CODE>first</CODE>) the
718first item in the result set to return, where the numbering starts at zero
719(so a value of
720zero corresponds to the first item returned being that with the highest
721score, and a value of 10 corresponds to the first 10 items being ignored,
722and the returned items starting at the eleventh).
723</P>
724<P>
725You must also specify (in <CODE>maxitems</CODE>) the maximum number of
726items to return.  Unless there are not enough matching items, precisely
727this number of items will be returned.
728If <CODE>maxitems</CODE> is zero, no items will be returned, but the usual
729statistics (such as the maximum possible weight which a document could be
730assigned by the query) will be calculated.  (See &quot;The Xapian::MSet&quot;
731below).
732</P>
733
734<H3>The Xapian::MSet</H3>
735<P>
736Query results are returned in an
737<A HREF="apidoc/html/classXapian_1_1MSet.html"><CODE>Xapian::MSet</CODE></A> object.
738The results can be accessed using a
739<A HREF="apidoc/html/classXapian_1_1MSetIterator.html"><CODE>Xapian::MSetIterator</CODE></A>
740which returns the matches in descending sorted order
741of relevance (so the most relevant document is first in the list).
742Each <CODE>Xapian::MSet</CODE> entry comprises a document id, and the weight
743calculated for that document.
744</P>
745<P>
746An <CODE>Xapian::MSet</CODE> also contains various information about the search
747result:
748<TABLE>
749<TR><TD VALIGN="top">
750<CODE>firstitem</CODE>
751</TD><TD>
752The index of the first item in the result which was put into the MSet.
753(Corresponding to <CODE>first</CODE> in
754<CODE>Xapian::Enquire::get_mset()</CODE>)
755</TD></TR><TR><TD VALIGN="top">
756<CODE>max_attained</CODE>
757</TD><TD VALIGN="top">
758The greatest weight which is attained in the full results of the search.
759</TD></TR><TR><TD VALIGN="top">
760<CODE>max_possible</CODE>
761</TD><TD VALIGN="top">
762The maximum possible weight in the MSet.
763</TD></TR><TR><TD VALIGN="top">
764<CODE>docs_considered</CODE>
765</TD><TD VALIGN="top">
766The number of documents matching the query considered for the MSet.
767This provides a lower bound on the number of documents in the database
768which have a weight greater than zero.  Note that this value may change
769if the search is recalculated with different values for <CODE>first</CODE> or
770<CODE>max_items<CODE>.
771</TD><TR>
772</TABLE>
773</P>
774<P>
775See the <A HREF="apidoc/html/classXapian_1_1MSet.html">automatically extracted documentation</A>
776for more details of these fields.
777</P>
778<P>
779The <CODE>Xapian::MSet</CODE> also provides methods for converting the score
780calculated for a given document into a percentage value, suitable for
781displaying to a user.  This may be done using the
782<A HREF="apidoc/html/classXapian_1_1MSet.html"><CODE>convert_to_percent()</CODE></A>
783methods:
784<PRE>
785     int Xapian::MSet::convert_to_percent(const Xapian::MSetIterator &amp; item) const
786     int Xapian::MSet::convert_to_percent(Xapian::weight wt) const
787</PRE>
788These methods return a value in the range 0 to 100, which will be
7890 if and only if the item did not match the query at all.
790</P>
791
792<H3>Accessing a document</H3>
793<P>
794A document in the database is accessed via a
795<A HREF="apidoc/html/classXapian_1_1Document.html"><CODE>Xapian::Document</CODE></A>
796object.
797This can be obtained by calling
798<A HREF="apidoc/html/classXapian_1_1Database.html"><CODE>Xapian::Database::get_document()</CODE></A>.
799The returned <CODE>Xapian::Document</CODE> is a reference counted handle so
800copying is cheap.
801</P>
802
803<P>
804Each document can have the following types of information associated with it:
805</P>
806
807<ul>
808<li> document data - this is an arbitrary block of data accessed using
809<A HREF="apidoc/html/classXapian_1_1Document.html"><CODE>Xapian::Document::get_data()</CODE></A>.
810The contents of the document data can be whatever you want and in whatever
811format.  Often it contains fields such as a URL or other external UID, a
812document title, and an excerpt from the document text.  If you wish to
813interoperate with Omega, it should contain name=value pairs, one per line
814(recent versions of Omega also support one field value per line, and
815can assign names to line numbers in the query template).
816
817<li> terms and positional information - terms index the document (like index
818entries in the back of a book); positional information records the word
819offset into the document of each occurrence of a particular term.  This is
820used to implement phrase searching and the NEAR operator.
821
822<li> document values - these are arbitrary pieces of data which are stored
823so they can be accessed rapidly during the match process (to allow sorting
824collapsing of duplicates, etc).  Each value is stored in a numbered slot
825so you can have several for each document.  There's currently no length limit,
826but you should keep them short for efficiency.
827</ul>
828
829<P>
830There's some overlap in what you can do with terms and with values.  A
831simple boolean operator (e.g. document language) is definitely better
832done using a term and OP_FILTER.
833</P>
834
835<P>
836Using a value allows you to do things you can't do with terms, such as
837"sort by price", or "show only the best match for each website".  You
838can also perform filtering with a value which is more sophisticated
839than can easily be achieved with terms, for example: find matches
840with a price between $100 and $900.  Omega uses boolean terms to perform
841date range filtering, but this might actually be better done using a
842value (the code in Omega was written before values were added to
843Xapian).
844</P>
845
846<H2>Specifying a relevance set</H2>
847<P>
848Xapian supports the idea of relevance feedback: that is, of allowing the user
849to mark documents as being relevant to the search, and using this information
850to modify the search.  This is supported by means of relevance sets, which
851are simply sets of document ids which are marked as relevant.  These
852are held in <A HREF="apidoc/html/classXapian_1_1RSet.html"><CODE>Xapian::RSet</CODE></A> objects,
853one of which may optionally be supplied to Xapian in the
854<CODE>omrset</CODE> parameter when calling
855<CODE>Xapian::Enquire::get_mset()</CODE>.
856</P>
857
858<H3>Match options</H3>
859
860<P>
861There are various additional options which may be specified when
862performing the query.  These are specified by calling
863<A HREF="apidoc/html/classXapian_1_1Enquire.html">various methods
864of the <CODE>Xapian::Enquire</CODE> object</A>.
865The options are as follows.
866</P>
867<TABLE>
868<TR><TD VALIGN="top">