root / tags / 1.0.8 / xapian-core / docs / quickstart.html

Revision 10165, 20.6 kB (checked in by olly, 10 months ago)

Backport change from trunk:
docs/quickstart.html: Remove information covered by INSTALL since
there's no good reason to repeat it and two copies just risks one
getting out of date (as has happened here!)

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<HTML>
3<HEAD>
4<TITLE>Xapian: Quickstart</TITLE>
5</HEAD>
6<BODY BGCOLOR="white">
7
8<H1>Quickstart</H1>
9
10<P>
11The document contains a quick introduction to the basic concepts, and then
12a walk-through development of a simple application using the Xapian
13library, together with commentary on how the application could be taken
14further.  It deliberately avoids going into a lot of detail - see the
15<a href="index.html">rest of the documentation</a> for more detail.
16</P>
17
18<HR>
19<H2>Requirements</H2>
20
21<P>
22Before following the steps outlined in this document, you will need to have
23the Xapian library installed on your system.
24For instructions on obtaining and installing Xapian, read the
25<A HREF="install.html">Installation</A> document.
26</P>
27
28<HR>
29<H2>Databases</H2>
30
31<P>
32An information retrieval system using Xapian typically has two parts.  The
33first part is the <EM>indexer</EM>, which takes documents in various
34formats, processes them so that they can be efficiently searched, and
35stores the processed documents in an appropriate data structure (the
36<EM>database</EM>).  The second part is the <EM>searcher</EM>, which takes
37queries and reads the database to return a list of the documents relevant
38to each query.
39</P>
40<P>
41The database is the data structure which ties the indexer and searcher
42together, and is fundamental to the retrieval process.  Given how
43fundamental it is, it is unsurprising that different applications put
44different demands on the database.  For example, some applications may be
45happy to deal with searching a static collection of data, but need to do
46this extremely fast (for example, a web search engine which builds new
47databases from scratch nightly or even weekly).  Other applications may
48require that new data can be added to the system incrementally, but don't
49require extremely high performance searching (perhaps an email system,
50which is only being searched occasionally).  There are many other
51constraints which may be placed on an information retrieval system: for
52example, it may be required to have small database sizes, even at the
53expense of getting poorer results from the system.
54</P>
55<P>
56To provide the required flexibility, Xapian has the ability to use one of
57many available database <EM>backends</EM>, each of which satisfies a
58different set of constraints, and stores its data in a different way.
59
60Currently, these must be compiled into the whole system, and selected at
61runtime, but the ability to dynamically load modules for each of these
62backends is likely to be added in future, and would require little design
63modification.
64</P>
65<!--
66<P>
67If you are in a real hurry, you could probably skip the rest of this
68section, but it is helpful to understand roughly what information Xapian
69stores in a database and how it is structured, and the following
70subsections detail this.
71</P>
72
73<H3>The contents of a database</H3>
74
75<P>
76FIXME: to be written.
77Documents, terms, data, keys.
78What can be accessed fast, what can't.
79How each piece of data might be stored.
80</P>
81
82<H3><A NAME="flint_databases">Flint databases</A></H3>
83
84<P>
85FIXME: to be written.
86</P>
87-->
88
89<HR>
90<H2><A NAME="indexer">An example indexer</A></H2>
91
92<P>
93We now present sample code for an indexer. This is deliberately
94simplified to make it easier to follow. You can also read it in <A
95HREF="quickstartindex.cc.html">an HTML formatted version</A>.
96</P>
97<P>
98The &quot;indexer&quot; presented here is simply a small program which
99takes a path to a database and a set of parameters defining a document on
100the command line, and stores that document as a new entry in the database.
101</P>
102<H3>Include header files</H3>
103<P>
104The first requirement in any program using the Xapian library is to
105include the Xapian header file, &quot;<CODE>xapian.h</CODE>&quot;:
106<PRE>    #include &lt;xapian.h&gt;</PRE>
107</P>
108<P>
109We're going to use C++ iostreams for output, so we need to include
110the <CODE>iostream</CODE> header, and we'll also import everything
111from namespace <CODE>std</CODE> for convenience:
112<PRE>    #include &lt;iostream&gt;
113    using namespace std;</PRE>
114</P>
115<P>
116Our example only has a single function, <CODE>main()</CODE>, so next we
117define that:
118<PRE>    int main(int argc, char **argv)</PRE>
119</P>
120<H3>Options parsing</H3>
121<P>
122For this example we do very simple options parsing.  We are going to
123use the core functionality of Xapian of searching for specific terms in the
124database, and we are not going to use any of the extra facilities, such as
125the keys which may be associated with each document.  We are also going to
126store a simple string as the data associated with each document.
127</P><P>
128Thus, our command line syntax is:
129<UL><LI>
130<B>Parameter 1</B> - the (possibly relative) path to the database.
131</LI><LI>
132<B>Parameter 2</B> - the string to be stored as the document data.
133</LI><LI>
134<B>Parameters 3 onward</B> - the terms to be stored in the database.  The
135terms will be assumed to occur at successive positions in the document.
136</LI></UL>
137</P><P>
138The validity of a command line can therefore be checked very simply by
139ensuring that there are at least 3 parameters:
140<PRE>
141    if (argc &lt; 4) {
142        cout &lt;&lt; "usage: " &lt;&lt; argv[0] &lt;&lt;
143                " &lt;path to database&gt; &lt;document data&gt; &lt;document terms&gt;" &lt;&lt; endl;
144        exit(1);
145    }
146</PRE>
147</P>
148
149<H3>Catching exceptions</H3>
150<P>
151When an error occurs in Xapian it is reported by means of the C++ exception
152mechanism.  All errors in Xapian are derived classes of
153<CODE>Xapian::Error</CODE>, so simple error handling can be performed by
154enclosing all the code in a try-catch block to catch any
155<CODE>Xapian::Error</CODE> exceptions.  A (hopefully) helpful message can be
156extracted from the <CODE>Xapian::Error</CODE> object by calling its
157<CODE>get_msg()</CODE> method, which returns a human readable string.
158</P>
159<P>
160Note that all calls to the Xapian library should be performed inside a
161try-catch block, since otherwise errors will result in uncaught exceptions;
162this usually results in the execution aborting.
163</P>
164<P>
165Note also that Xapian::Error is a virtual base class, and thus can't be copied:
166you must therefore catch exceptions by reference, as in the following example
167code:
168</P>
169<PRE>
170    try {
171        <B>[code which accesses Xapian]</B>
172    } catch (const Xapian::Error &amp; error) {
173        cout &lt;&lt; "Exception: " &lt;&lt; error.get_msg() &lt;&lt; endl;
174    }
175</PRE>
176
177<H3>Opening the database</H3>
178
179<P>
180In Xapian, a database is opened for writing by creating a
181Xapian::WritableDatabase object.
182</P>
183<P>
184If you pass Xapian::DB_CREATE_OR_OPEN and there isn't an existing database
185in the specified directory, Xapian will try to create a new empty database
186there.  If there is already database in the specified directory, it will be
187opened.
188</P>
189<P>
190If an error occurs when trying to open a database, or to create a new database,
191an exception, usually of type <CODE>Xapian::DatabaseOpeningError</CODE> or
192<CODE>Xapian::DatabaseCreateError</CODE>, will be thrown.
193</P>
194<P>
195The code to open a database for writing is, then:
196</P>
197
198<PRE>
199    Xapian::WritableDatabase database(argv[1], Xapian::DB_CREATE_OR_OPEN);
200</PRE>
201
202<H3>Preparing the new document</H3>
203
204<P>
205Now that we have the database open, we need to prepare a document to
206put in it.  This is done by creating a Xapian::Document object, filling
207this with data, and then giving it to the database.
208</P>
209
210<P>
211The first step, then, is to create the document:
212</P>
213<PRE>
214    Xapian::Document newdocument;
215</PRE>
216
217<P>
218Each <code>Xapian::Document</code> has a "cargo" known as the <i>document data</i>.
219This data is opaque to Xapian - the meaning of it is entirely user-defined.
220Typically it contains information to allow results to be displayed by the
221application, for example a URL for the indexed document and
222some text which is to be displayed when returning the document as search
223result.
224</P>
225<P>
226For our example, we shall simply store the second parameter given on the
227command line in the data field:
228</P>
229<PRE>
230    newdocument.set_data(string(argv[2]));
231</PRE>
232
233<P>
234The next step is to put the terms which are to be used when searching
235for the document into the Xapian::Document object.
236</P>
237<P>
238We shall use the <CODE>add_posting()</CODE> method, which adds an
239occurrence of a term to the struct.  The first parameter is the
240&quot;<EM>termname</EM>&quot;, which is a string defining the term.  This
241string can be anything, as long as the same string is always used to refer
242to the same term.  The string will often be the (possibly stemmed) text
243of the term, but might be in a compressed, or even hashed, form.
244In general, there is no upper limit to the length of a termname, but some
245database methods may impose their own limits.
246</P>
247<P>
248The second parameter is the position at which the term occurs within the
249document.  These positions start at 1.  This information is used for
250some search features such as phrase matching or passage retrieval, but
251is not essential to the search.
252</P>
253
254<P>
255We add postings for terms with the termname given as each of the remaining
256command line parameters:
257</P>
258<PRE>
259    for (int i = 3; i &lt; argc; ++i) {
260        newdocument.add_posting(argv[i], i - 2);
261    }
262</PRE>
263
264<H3>Adding the document to the database</H3>
265
266<P>
267Finally, we can add the document to the database.  This simply involves
268calling <CODE>Xapian::WritableDatabase::add_document()</CODE>, and passing it
269the <CODE>Xapian::Document</CODE> object:
270</P>
271<PRE>
272    database.add_document(newdocument);
273</PRE>
274
275<P>
276The operation of adding a document is atomic: either the document will be
277added, or an exception will be thrown and the document will not be in the
278new database.
279</P>
280<P>
281<CODE>add_document()</CODE> returns a value of type <CODE>Xapian::docid</CODE>.
282This is the document ID of the newly added document, which is simply a
283handle which can be used to access the document in future.
284</P>
285<P>
286Note that this use of <CODE>add_document()</CODE> is actually fairly
287inefficient: if we had a large database, it would be desirable to group
288as many document additions together as possible, by encapsulating
289them within a session.  For details of this, and of the transaction
290facility for performing sets of database modifications atomically, see
291the <A HREF="overview.html">API Overview</A>.
292</P>
293
294<HR>
295<H2><A NAME="searcher">An example searcher</A></H2>
296
297<P>
298Now we show the code for a simple searcher, which will search the
299database built by the indexer above. Again, you can read <A
300HREF="quickstartsearch.cc.html">an HTML formatted version</A>.
301</P>
302<P>
303The &quot;searcher&quot; presented here is, like the &quot;indexer&quot;,
304simply a small command line driven program.  It takes a path to a database
305and some search terms, performs a probabilistic search for documents
306represented by those terms and displays a ranked list of matching documents.
307</P>
308
309<H3>Setting up</H3>
310
311<P>
312Just like &quot;quickstartindex&quot;, we have a single-function example.
313So we include the Xapian header file, and begin:
314</P>
315<PRE>
316    #include &lt;xapian.h&gt;
317
318    int main(int argc, char **argv)
319    {
320</PRE>
321
322<H3>Options parsing</H3>
323<P>
324Again, we are going to use no special options, and have a very simple
325command line syntax:
326<UL><LI>
327<B>Parameter 1</B> - the (possibly relative) path to the database.
328</LI><LI>
329<B>Parameters 2 onward</B> - the terms to be searched for in the database.
330</LI></UL>
331</P><P>
332The validity of a command line can therefore be checked very simply by
333ensuring that there are at least 2 parameters:
334</P>
335<PRE>
336    if (argc &lt; 3) {
337        cout &lt;&lt; "usage: " &lt;&lt; argv[0] &lt;&lt;
338                " &lt;path to database&gt; &lt;search terms&gt;" &lt;&lt; endl;
339        exit(1);
340    }
341</PRE>
342</P>
343
344<H3>Catching exceptions</H3>
345<P>
346Again, this is performed just as it was for the simple indexer.
347</P>
348<PRE>
349    try {
350        <B>[code which accesses Xapian]</B>
351    } catch (const Xapian::Error &amp; error) {
352        cout &lt;&lt; "Exception: " &lt;&lt; error.get_msg() &lt;&lt; endl;
353    }
354</PRE>
355
356<H3>Specifying the databases</H3>
357<P>
358Xapian has the ability to search over many databases simultaneously,
359possibly even with the databases distributed across a network of machines.
360Each database can be in its own format, so, for example, we might have a
361system searching across two remote databases and a flint database.
362</P>
363<P>
364To open a single database, we create a Xapian::Database object, passing
365the path to the database we want to open:
366</P>
367<PRE>
368    Xapian::Database db(argv[1]);
369</PRE>
370<P>
371You can also search multiple database by adding them together using
372<CODE>Xapian::Database::add_database</CODE>:
373</P>
374<PRE>
375    Xapian::Database databases;
376    databases.add_database(Xapian::Database(argv[1]));
377    databases.add_database(Xapian::Database(argv[2]));
378</PRE>
379
380<H3>Starting an enquire session</H3>
381<P>
382All searches across databases by Xapian are performed within the context of
383an &quot;<EM>Enquire</EM>&quot; session.  This session is represented by a
384<CODE>Xapian::Enquire</CODE> object, and is across a specified collection of
385databases.  To change the database collection, it is necessary to open a
386new enquire session, by creating a new <CODE>Xapian::Enquire</CODE> object.
387<PRE>
388    Xapian::Enquire enquire(databases);
389</PRE>
390</P>
391<P>
392An enquire session is also the context within which all other database
393reading operations, such as query expansion and reading the data associated
394with a document, are performed.
395</P>
396
397<H3>Preparing to search</H3>
398
399<P>
400We are going to use all command line parameters from the second onward
401as terms to search for in the database.  For convenience, we shall store
402them in an STL vector.  This is probably the point at which we would want
403to apply a stemming algorithm, or any other desired normalisation and
404conversion operation, to the terms.
405<PRE>
406    vector&lt;string&gt; queryterms;
407    for (int optpos = 2; optpos &lt; argc; optpos++) {
408        queryterms.push_back(argv[optpos]);
409    }
410</PRE>
411</P>
412
413<P>
414Queries are represented within Xapian by <CODE>Xapian::Query</CODE> objects, so
415the next step is to construct one from our query terms.
416Conveniently there is a constructor which will take our vector
417of terms and create an <CODE>Xapian::Query</CODE> object from it.
418<PRE>
419    Xapian::Query query(Xapian::Query::OP_OR, queryterms.begin(), queryterms.end());
420</PRE>
421</P>
422
423<P>
424You will notice that we had to specify an operation to be performed on
425the terms (the <CODE>Xapian::Query::OP_OR</CODE> parameter).
426Queries in Xapian are actually
427fairly complex things: a full range of boolean operations can be applied to
428queries to restrict the result set, and probabilistic weightings are then
429applied to order the results by relevance.  By specifying the OR operation,
430we are not performing any boolean restriction, and are performing a
431traditional pure probabilistic search.
432</P>
433
434<P>
435We now print a message out to confirm to the user what the query being
436performed is.  This is done with the <CODE>Xapian::Query::get_description()</CODE>
437method, which is mainly included for debugging purposes, and displays
438a string representation of the query.
439</P>
440<PRE>
441    cout &lt;&lt; "Performing query `" &lt;&lt;
442         query.get_description() &lt;&lt; "'" &lt;&lt; endl;
443</PRE>
444
445<H3>Performing the search</H3>
446<P>
447Now, we are ready to perform the search.  The first step of this is to
448give the query object to the enquire session.  Note that the query is
449copied at this operation, and that changing the Xapian::Query object after
450setting the query with it has no effect.
451</P>
452<PRE>
453    enquire.set_query(query);
454</PRE>
455
456<P>
457Next, we ask for the results of the search.  There is no need to tell
458Xapian to perform the search: it will do this automatically.  We use
459the <CODE>get_mset()</CODE> method to get the results, which are returned
460in an <CODE>Xapian::MSet</CODE> object.  (MSet for Match Set)
461</P>
462<P>
463<CODE>get_mset()</CODE> can take many parameters, such as a set of
464relevant documents to use, and various options to modify the search,
465but we give it the minimum; which is the first document to return (starting
466at 0 for the top ranked document), and the maximum number of documents
467to return (we specify 10 here):
468<PRE>
469    Xapian::MSet matches = enquire.get_mset(0, 10);
470</PRE>
471</P>
472
473<H3>Displaying the results of the search</H3>
474<P>
475Finally, we display the results of the search.  The results are stored in
476in the <CODE>Xapian::MSet</CODE> object, which provides the features required
477to be an STL-compatible container, so first we display how many items are in
478the MSet:
479<PRE>
480    cout &lt;&lt; matches.size() &lt;&lt; " results found" &lt;&lt; endl;
481</PRE>
482</P>
483
484<P>
485Now we display some information about each of the items in the
486<CODE>Xapian::MSet</CODE>.  We access these items using an
487<CODE>Xapian::MSetIterator</CODE>:
488<UL><LI>
489First, we display the document ID, accessed by <CODE>*i</CODE>.
490This is not usually very useful information to give to users, but it is
491at least a unique handle on each document.
492</LI><LI>
493Next, we display a &quot;percentage&quot; score for the document.  Readers
494familiar with Information Retrieval will not be surprised to hear that this
495is not really a percentage: it is just a value from 0 to 100, such that a
496more relevant document has a higher value.  We get this using
497<CODE>i.get_percent()</CODE>.
498</LI><LI>
499Last, we display the data associated with each returned document, which
500was specified by the user at database generation time.  To do this, we
501first use <CODE>i.get_document()</CODE> to get an <CODE>Xapian::Document</CODE>
502object representing the returned document; then we use the
503<CODE>get_data()</CODE> method of this object to get
504access to the data stored in this document.
505</LI></UL>
506<PRE>
507    Xapian::MSetIterator i;
508    for (i = matches.begin(); i != matches.end(); ++i) {
509        cout &lt;&lt; "Document ID " &lt;&lt; *i &lt;&lt; "\t";
510        cout &lt;&lt; i.get_percent() &lt;&lt; "% ";
511        Xapian::Document doc = i.get_document();
512        cout &lt;&lt; "[" &lt;&lt; doc.get_data() &lt;&lt; "]" &lt;&lt; endl;
513    }
514</PRE>
515</P>
516
517<HR>
518<H2>Compiling</H2>
519
520Now that we have the code written, all we need to do is compile it!
521
522<H3>Finding the Xapian library</H3>
523
524<P>
525A small utility, &quot;xapian-config&quot;, is installed along with Xapian
526to assist you in finding the installed Xapian library, and in generating
527the flags to pass to the compiler and linker to compile.
528</P><P>
529After a successful compilation, this utility should be in your path, so
530you can simply run
531<BLOCKQUOTE><CODE>xapian-config --cxxflags</CODE></BLOCKQUOTE>
532to determine the flags to pass to the compiler, and
533<BLOCKQUOTE><CODE>xapian-config --libs</CODE></BLOCKQUOTE>
534to determine the flags to pass to the linker.
535
536These flags are returned on the utility's standard output (so you could use
537backtick notation to include them on your command line).
538</P><P>
539If your project uses the GNU autoconf tool, you may also use the
540<CODE>XO_LIB_XAPIAN</CODE> macro, which is included as part of Xapian,
541and will check for an installation of Xapian and set (and
542<CODE>AC_SUBST</CODE>) the <CODE>XAPIAN_CXXFLAGS</CODE> and
543<CODE>XAPIAN_LIBS</CODE> variables to
544be the flags to pass to the compiler and linker, respectively.
545</P><P>
546If you don't use GNU autoconf, don't worry about this.
547</P>
548
549<H3>Compiling the quickstart examples</H3>
550Once you know the compilation flags, compilation is a simple matter of
551invoking the compiler!  For our example, we could compile the two
552utilities (quickstartindex and quickstartsearch) with the commands:
553<PRE>
554c++ quickstartindex.cc `xapian-config --libs --cxxflags` -o quickstartindex
555c++ quickstartsearch.cc `xapian-config --libs --cxxflags` -o quickstartsearch
556</PRE>
557
558<HR>
559<H2>Running the examples</H2>
560
561<P>
562Once we have compiled the above examples, we can build up a simple
563database as follows.  Note that we must first create a directory for
564the database files to live in; although Xapian will create new empty
565database files if they do not yet exist, it will not create a new
566directory for them.
567<PRE>
568$ mkdir proverbs
569$ ./quickstartindex proverbs \
570&gt; "people who live in glass houses should not throw stones" \
571&gt; people live glass house stone
572$ ./quickstartindex proverbs \
573&gt; "Don't look a gift horse in the mouth" \
574&gt; look gift horse mouth
575</PRE>
576</P>
577
578<P>
579Now, we should have a database with a couple of documents in it.  Looking
580in the database directory, you should see something like:
581<PRE>
582$ ls proverbs/
583<i>[some files]</i>
584</PRE>
585</P>
586<P>
587Given the small amount of data in the database, you may be concerned that
588the total size of these files is somewhat over 50k.  Be reassured that the
589database is block structured, here consisting of largely empty
590blocks, and will behave much better for large databases.
591</P>
592
593<P>
594We can now perform searches over the database using the quickstartsearch
595program.
596<PRE>
597$ ./quickstartsearch proverbs look
598Performing query `look'
5991 results found
600Document ID 2   50% [Don't look a gift horse in the mouth]
601</PRE>
602</P>
603
604<!-- FOOTER $Author$ $Date$ $Id$ -->
605</BODY>
606</HTML>
Note: See TracBrowser for help on using the browser.