| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>Xapian: Quickstart</TITLE> |
|---|
| 5 | </HEAD> |
|---|
| 6 | <BODY BGCOLOR="white"> |
|---|
| 7 | |
|---|
| 8 | <H1>Quickstart</H1> |
|---|
| 9 | |
|---|
| 10 | <P> |
|---|
| 11 | The document contains a quick introduction to the basic concepts, and then |
|---|
| 12 | a walk-through development of a simple application using the Xapian |
|---|
| 13 | library, together with commentary on how the application could be taken |
|---|
| 14 | further. It deliberately avoids going into a lot of detail - see the |
|---|
| 15 | <a href="index.html">rest of the documentation</a> for more detail. |
|---|
| 16 | </P> |
|---|
| 17 | |
|---|
| 18 | <HR> |
|---|
| 19 | <H2>Requirements</H2> |
|---|
| 20 | |
|---|
| 21 | <P> |
|---|
| 22 | Before following the steps outlined in this document, you will need to have |
|---|
| 23 | the Xapian library installed on your system. |
|---|
| 24 | For instructions on obtaining and installing Xapian, read the |
|---|
| 25 | <A HREF="install.html">Installation</A> document. |
|---|
| 26 | </P> |
|---|
| 27 | |
|---|
| 28 | <HR> |
|---|
| 29 | <H2>Databases</H2> |
|---|
| 30 | |
|---|
| 31 | <P> |
|---|
| 32 | An information retrieval system using Xapian typically has two parts. The |
|---|
| 33 | first part is the <EM>indexer</EM>, which takes documents in various |
|---|
| 34 | formats, processes them so that they can be efficiently searched, and |
|---|
| 35 | stores the processed documents in an appropriate data structure (the |
|---|
| 36 | <EM>database</EM>). The second part is the <EM>searcher</EM>, which takes |
|---|
| 37 | queries and reads the database to return a list of the documents relevant |
|---|
| 38 | to each query. |
|---|
| 39 | </P> |
|---|
| 40 | <P> |
|---|
| 41 | The database is the data structure which ties the indexer and searcher |
|---|
| 42 | together, and is fundamental to the retrieval process. Given how |
|---|
| 43 | fundamental it is, it is unsurprising that different applications put |
|---|
| 44 | different demands on the database. For example, some applications may be |
|---|
| 45 | happy to deal with searching a static collection of data, but need to do |
|---|
| 46 | this extremely fast (for example, a web search engine which builds new |
|---|
| 47 | databases from scratch nightly or even weekly). Other applications may |
|---|
| 48 | require that new data can be added to the system incrementally, but don't |
|---|
| 49 | require extremely high performance searching (perhaps an email system, |
|---|
| 50 | which is only being searched occasionally). There are many other |
|---|
| 51 | constraints which may be placed on an information retrieval system: for |
|---|
| 52 | example, it may be required to have small database sizes, even at the |
|---|
| 53 | expense of getting poorer results from the system. |
|---|
| 54 | </P> |
|---|
| 55 | <P> |
|---|
| 56 | To provide the required flexibility, Xapian has the ability to use one of |
|---|
| 57 | many available database <EM>backends</EM>, each of which satisfies a |
|---|
| 58 | different set of constraints, and stores its data in a different way. |
|---|
| 59 | |
|---|
| 60 | Currently, these must be compiled into the whole system, and selected at |
|---|
| 61 | runtime, but the ability to dynamically load modules for each of these |
|---|
| 62 | backends is likely to be added in future, and would require little design |
|---|
| 63 | modification. |
|---|
| 64 | </P> |
|---|
| 65 | <!-- |
|---|
| 66 | <P> |
|---|
| 67 | If you are in a real hurry, you could probably skip the rest of this |
|---|
| 68 | section, but it is helpful to understand roughly what information Xapian |
|---|
| 69 | stores in a database and how it is structured, and the following |
|---|
| 70 | subsections detail this. |
|---|
| 71 | </P> |
|---|
| 72 | |
|---|
| 73 | <H3>The contents of a database</H3> |
|---|
| 74 | |
|---|
| 75 | <P> |
|---|
| 76 | FIXME: to be written. |
|---|
| 77 | Documents, terms, data, keys. |
|---|
| 78 | What can be accessed fast, what can't. |
|---|
| 79 | How each piece of data might be stored. |
|---|
| 80 | </P> |
|---|
| 81 | |
|---|
| 82 | <H3><A NAME="flint_databases">Flint databases</A></H3> |
|---|
| 83 | |
|---|
| 84 | <P> |
|---|
| 85 | FIXME: to be written. |
|---|
| 86 | </P> |
|---|
| 87 | --> |
|---|
| 88 | |
|---|
| 89 | <HR> |
|---|
| 90 | <H2><A NAME="indexer">An example indexer</A></H2> |
|---|
| 91 | |
|---|
| 92 | <P> |
|---|
| 93 | We now present sample code for an indexer. This is deliberately |
|---|
| 94 | simplified to make it easier to follow. You can also read it in <A |
|---|
| 95 | HREF="quickstartindex.cc.html">an HTML formatted version</A>. |
|---|
| 96 | </P> |
|---|
| 97 | <P> |
|---|
| 98 | The "indexer" presented here is simply a small program which |
|---|
| 99 | takes a path to a database and a set of parameters defining a document on |
|---|
| 100 | the command line, and stores that document as a new entry in the database. |
|---|
| 101 | </P> |
|---|
| 102 | <H3>Include header files</H3> |
|---|
| 103 | <P> |
|---|
| 104 | The first requirement in any program using the Xapian library is to |
|---|
| 105 | include the Xapian header file, "<CODE>xapian.h</CODE>": |
|---|
| 106 | <PRE> #include <xapian.h></PRE> |
|---|
| 107 | </P> |
|---|
| 108 | <P> |
|---|
| 109 | We're going to use C++ iostreams for output, so we need to include |
|---|
| 110 | the <CODE>iostream</CODE> header, and we'll also import everything |
|---|
| 111 | from namespace <CODE>std</CODE> for convenience: |
|---|
| 112 | <PRE> #include <iostream> |
|---|
| 113 | using namespace std;</PRE> |
|---|
| 114 | </P> |
|---|
| 115 | <P> |
|---|
| 116 | Our example only has a single function, <CODE>main()</CODE>, so next we |
|---|
| 117 | define that: |
|---|
| 118 | <PRE> int main(int argc, char **argv)</PRE> |
|---|
| 119 | </P> |
|---|
| 120 | <H3>Options parsing</H3> |
|---|
| 121 | <P> |
|---|
| 122 | For this example we do very simple options parsing. We are going to |
|---|
| 123 | use the core functionality of Xapian of searching for specific terms in the |
|---|
| 124 | database, and we are not going to use any of the extra facilities, such as |
|---|
| 125 | the keys which may be associated with each document. We are also going to |
|---|
| 126 | store a simple string as the data associated with each document. |
|---|
| 127 | </P><P> |
|---|
| 128 | Thus, our command line syntax is: |
|---|
| 129 | <UL><LI> |
|---|
| 130 | <B>Parameter 1</B> - the (possibly relative) path to the database. |
|---|
| 131 | </LI><LI> |
|---|
| 132 | <B>Parameter 2</B> - the string to be stored as the document data. |
|---|
| 133 | </LI><LI> |
|---|
| 134 | <B>Parameters 3 onward</B> - the terms to be stored in the database. The |
|---|
| 135 | terms will be assumed to occur at successive positions in the document. |
|---|
| 136 | </LI></UL> |
|---|
| 137 | </P><P> |
|---|
| 138 | The validity of a command line can therefore be checked very simply by |
|---|
| 139 | ensuring that there are at least 3 parameters: |
|---|
| 140 | <PRE> |
|---|
| 141 | if (argc < 4) { |
|---|
| 142 | cout << "usage: " << argv[0] << |
|---|
| 143 | " <path to database> <document data> <document terms>" << endl; |
|---|
| 144 | exit(1); |
|---|
| 145 | } |
|---|
| 146 | </PRE> |
|---|
| 147 | </P> |
|---|
| 148 | |
|---|
| 149 | <H3>Catching exceptions</H3> |
|---|
| 150 | <P> |
|---|
| 151 | When an error occurs in Xapian it is reported by means of the C++ exception |
|---|
| 152 | mechanism. All errors in Xapian are derived classes of |
|---|
| 153 | <CODE>Xapian::Error</CODE>, so simple error handling can be performed by |
|---|
| 154 | enclosing all the code in a try-catch block to catch any |
|---|
| 155 | <CODE>Xapian::Error</CODE> exceptions. A (hopefully) helpful message can be |
|---|
| 156 | extracted from the <CODE>Xapian::Error</CODE> object by calling its |
|---|
| 157 | <CODE>get_msg()</CODE> method, which returns a human readable string. |
|---|
| 158 | </P> |
|---|
| 159 | <P> |
|---|
| 160 | Note that all calls to the Xapian library should be performed inside a |
|---|
| 161 | try-catch block, since otherwise errors will result in uncaught exceptions; |
|---|
| 162 | this usually results in the execution aborting. |
|---|
| 163 | </P> |
|---|
| 164 | <P> |
|---|
| 165 | Note also that Xapian::Error is a virtual base class, and thus can't be copied: |
|---|
| 166 | you must therefore catch exceptions by reference, as in the following example |
|---|
| 167 | code: |
|---|
| 168 | </P> |
|---|
| 169 | <PRE> |
|---|
| 170 | try { |
|---|
| 171 | <B>[code which accesses Xapian]</B> |
|---|
| 172 | } catch (const Xapian::Error & error) { |
|---|
| 173 | cout << "Exception: " << error.get_msg() << endl; |
|---|
| 174 | } |
|---|
| 175 | </PRE> |
|---|
| 176 | |
|---|
| 177 | <H3>Opening the database</H3> |
|---|
| 178 | |
|---|
| 179 | <P> |
|---|
| 180 | In Xapian, a database is opened for writing by creating a |
|---|
| 181 | Xapian::WritableDatabase object. |
|---|
| 182 | </P> |
|---|
| 183 | <P> |
|---|
| 184 | If you pass Xapian::DB_CREATE_OR_OPEN and there isn't an existing database |
|---|
| 185 | in the specified directory, Xapian will try to create a new empty database |
|---|
| 186 | there. If there is already database in the specified directory, it will be |
|---|
| 187 | opened. |
|---|
| 188 | </P> |
|---|
| 189 | <P> |
|---|
| 190 | If an error occurs when trying to open a database, or to create a new database, |
|---|
| 191 | an exception, usually of type <CODE>Xapian::DatabaseOpeningError</CODE> or |
|---|
| 192 | <CODE>Xapian::DatabaseCreateError</CODE>, will be thrown. |
|---|
| 193 | </P> |
|---|
| 194 | <P> |
|---|
| 195 | The code to open a database for writing is, then: |
|---|
| 196 | </P> |
|---|
| 197 | |
|---|
| 198 | <PRE> |
|---|
| 199 | Xapian::WritableDatabase database(argv[1], Xapian::DB_CREATE_OR_OPEN); |
|---|
| 200 | </PRE> |
|---|
| 201 | |
|---|
| 202 | <H3>Preparing the new document</H3> |
|---|
| 203 | |
|---|
| 204 | <P> |
|---|
| 205 | Now that we have the database open, we need to prepare a document to |
|---|
| 206 | put in it. This is done by creating a Xapian::Document object, filling |
|---|
| 207 | this with data, and then giving it to the database. |
|---|
| 208 | </P> |
|---|
| 209 | |
|---|
| 210 | <P> |
|---|
| 211 | The first step, then, is to create the document: |
|---|
| 212 | </P> |
|---|
| 213 | <PRE> |
|---|
| 214 | Xapian::Document newdocument; |
|---|
| 215 | </PRE> |
|---|
| 216 | |
|---|
| 217 | <P> |
|---|
| 218 | Each <code>Xapian::Document</code> has a "cargo" known as the <i>document data</i>. |
|---|
| 219 | This data is opaque to Xapian - the meaning of it is entirely user-defined. |
|---|
| 220 | Typically it contains information to allow results to be displayed by the |
|---|
| 221 | application, for example a URL for the indexed document and |
|---|
| 222 | some text which is to be displayed when returning the document as search |
|---|
| 223 | result. |
|---|
| 224 | </P> |
|---|
| 225 | <P> |
|---|
| 226 | For our example, we shall simply store the second parameter given on the |
|---|
| 227 | command line in the data field: |
|---|
| 228 | </P> |
|---|
| 229 | <PRE> |
|---|
| 230 | newdocument.set_data(string(argv[2])); |
|---|
| 231 | </PRE> |
|---|
| 232 | |
|---|
| 233 | <P> |
|---|
| 234 | The next step is to put the terms which are to be used when searching |
|---|
| 235 | for the document into the Xapian::Document object. |
|---|
| 236 | </P> |
|---|
| 237 | <P> |
|---|
| 238 | We shall use the <CODE>add_posting()</CODE> method, which adds an |
|---|
| 239 | occurrence of a term to the struct. The first parameter is the |
|---|
| 240 | "<EM>termname</EM>", which is a string defining the term. This |
|---|
| 241 | string can be anything, as long as the same string is always used to refer |
|---|
| 242 | to the same term. The string will often be the (possibly stemmed) text |
|---|
| 243 | of the term, but might be in a compressed, or even hashed, form. |
|---|
| 244 | In general, there is no upper limit to the length of a termname, but some |
|---|
| 245 | database methods may impose their own limits. |
|---|
| 246 | </P> |
|---|
| 247 | <P> |
|---|
| 248 | The second parameter is the position at which the term occurs within the |
|---|
| 249 | document. These positions start at 1. This information is used for |
|---|
| 250 | some search features such as phrase matching or passage retrieval, but |
|---|
| 251 | is not essential to the search. |
|---|
| 252 | </P> |
|---|
| 253 | |
|---|
| 254 | <P> |
|---|
| 255 | We add postings for terms with the termname given as each of the remaining |
|---|
| 256 | command line parameters: |
|---|
| 257 | </P> |
|---|
| 258 | <PRE> |
|---|
| 259 | for (int i = 3; i < argc; ++i) { |
|---|
| 260 | newdocument.add_posting(argv[i], i - 2); |
|---|
| 261 | } |
|---|
| 262 | </PRE> |
|---|
| 263 | |
|---|
| 264 | <H3>Adding the document to the database</H3> |
|---|
| 265 | |
|---|
| 266 | <P> |
|---|
| 267 | Finally, we can add the document to the database. This simply involves |
|---|
| 268 | calling <CODE>Xapian::WritableDatabase::add_document()</CODE>, and passing it |
|---|
| 269 | the <CODE>Xapian::Document</CODE> object: |
|---|
| 270 | </P> |
|---|
| 271 | <PRE> |
|---|
| 272 | database.add_document(newdocument); |
|---|
| 273 | </PRE> |
|---|
| 274 | |
|---|
| 275 | <P> |
|---|
| 276 | The operation of adding a document is atomic: either the document will be |
|---|
| 277 | added, or an exception will be thrown and the document will not be in the |
|---|
| 278 | new database. |
|---|
| 279 | </P> |
|---|
| 280 | <P> |
|---|
| 281 | <CODE>add_document()</CODE> returns a value of type <CODE>Xapian::docid</CODE>. |
|---|
| 282 | This is the document ID of the newly added document, which is simply a |
|---|
| 283 | handle which can be used to access the document in future. |
|---|
| 284 | </P> |
|---|
| 285 | <P> |
|---|
| 286 | Note that this use of <CODE>add_document()</CODE> is actually fairly |
|---|
| 287 | inefficient: if we had a large database, it would be desirable to group |
|---|
| 288 | as many document additions together as possible, by encapsulating |
|---|
| 289 | them within a session. For details of this, and of the transaction |
|---|
| 290 | facility for performing sets of database modifications atomically, see |
|---|
| 291 | the <A HREF="overview.html">API Overview</A>. |
|---|
| 292 | </P> |
|---|
| 293 | |
|---|
| 294 | <HR> |
|---|
| 295 | <H2><A NAME="searcher">An example searcher</A></H2> |
|---|
| 296 | |
|---|
| 297 | <P> |
|---|
| 298 | Now we show the code for a simple searcher, which will search the |
|---|
| 299 | database built by the indexer above. Again, you can read <A |
|---|
| 300 | HREF="quickstartsearch.cc.html">an HTML formatted version</A>. |
|---|
| 301 | </P> |
|---|
| 302 | <P> |
|---|
| 303 | The "searcher" presented here is, like the "indexer", |
|---|
| 304 | simply a small command line driven program. It takes a path to a database |
|---|
| 305 | and some search terms, performs a probabilistic search for documents |
|---|
| 306 | represented by those terms and displays a ranked list of matching documents. |
|---|
| 307 | </P> |
|---|
| 308 | |
|---|
| 309 | <H3>Setting up</H3> |
|---|
| 310 | |
|---|
| 311 | <P> |
|---|
| 312 | Just like "quickstartindex", we have a single-function example. |
|---|
| 313 | So we include the Xapian header file, and begin: |
|---|
| 314 | </P> |
|---|
| 315 | <PRE> |
|---|
| 316 | #include <xapian.h> |
|---|
| 317 | |
|---|
| 318 | int main(int argc, char **argv) |
|---|
| 319 | { |
|---|
| 320 | </PRE> |
|---|
| 321 | |
|---|
| 322 | <H3>Options parsing</H3> |
|---|
| 323 | <P> |
|---|
| 324 | Again, we are going to use no special options, and have a very simple |
|---|
| 325 | command line syntax: |
|---|
| 326 | <UL><LI> |
|---|
| 327 | <B>Parameter 1</B> - the (possibly relative) path to the database. |
|---|
| 328 | </LI><LI> |
|---|
| 329 | <B>Parameters 2 onward</B> - the terms to be searched for in the database. |
|---|
| 330 | </LI></UL> |
|---|
| 331 | </P><P> |
|---|
| 332 | The validity of a command line can therefore be checked very simply by |
|---|
| 333 | ensuring that there are at least 2 parameters: |
|---|
| 334 | </P> |
|---|
| 335 | <PRE> |
|---|
| 336 | if (argc < 3) { |
|---|
| 337 | cout << "usage: " << argv[0] << |
|---|
| 338 | " <path to database> <search terms>" << endl; |
|---|
| 339 | exit(1); |
|---|
| 340 | } |
|---|
| 341 | </PRE> |
|---|
| 342 | </P> |
|---|
| 343 | |
|---|
| 344 | <H3>Catching exceptions</H3> |
|---|
| 345 | <P> |
|---|
| 346 | Again, this is performed just as it was for the simple indexer. |
|---|
| 347 | </P> |
|---|
| 348 | <PRE> |
|---|
| 349 | try { |
|---|
| 350 | <B>[code which accesses Xapian]</B> |
|---|
| 351 | } catch (const Xapian::Error & error) { |
|---|
| 352 | cout << "Exception: " << error.get_msg() << endl; |
|---|
| 353 | } |
|---|
| 354 | </PRE> |
|---|
| 355 | |
|---|
| 356 | <H3>Specifying the databases</H3> |
|---|
| 357 | <P> |
|---|
| 358 | Xapian has the ability to search over many databases simultaneously, |
|---|
| 359 | possibly even with the databases distributed across a network of machines. |
|---|
| 360 | Each database can be in its own format, so, for example, we might have a |
|---|
| 361 | system searching across two remote databases and a flint database. |
|---|
| 362 | </P> |
|---|
| 363 | <P> |
|---|
| 364 | To open a single database, we create a Xapian::Database object, passing |
|---|
| 365 | the path to the database we want to open: |
|---|
| 366 | </P> |
|---|
| 367 | <PRE> |
|---|
| 368 | Xapian::Database db(argv[1]); |
|---|
| 369 | </PRE> |
|---|
| 370 | <P> |
|---|
| 371 | You can also search multiple database by adding them together using |
|---|
| 372 | <CODE>Xapian::Database::add_database</CODE>: |
|---|
| 373 | </P> |
|---|
| 374 | <PRE> |
|---|
| 375 | Xapian::Database databases; |
|---|
| 376 | databases.add_database(Xapian::Database(argv[1])); |
|---|
| 377 | databases.add_database(Xapian::Database(argv[2])); |
|---|
| 378 | </PRE> |
|---|
| 379 | |
|---|
| 380 | <H3>Starting an enquire session</H3> |
|---|
| 381 | <P> |
|---|
| 382 | All searches across databases by Xapian are performed within the context of |
|---|
| 383 | an "<EM>Enquire</EM>" session. This session is represented by a |
|---|
| 384 | <CODE>Xapian::Enquire</CODE> object, and is across a specified collection of |
|---|
| 385 | databases. To change the database collection, it is necessary to open a |
|---|
| 386 | new enquire session, by creating a new <CODE>Xapian::Enquire</CODE> object. |
|---|
| 387 | <PRE> |
|---|
| 388 | Xapian::Enquire enquire(databases); |
|---|
| 389 | </PRE> |
|---|
| 390 | </P> |
|---|
| 391 | <P> |
|---|
| 392 | An enquire session is also the context within which all other database |
|---|
| 393 | reading operations, such as query expansion and reading the data associated |
|---|
| 394 | with a document, are performed. |
|---|
| 395 | </P> |
|---|
| 396 | |
|---|
| 397 | <H3>Preparing to search</H3> |
|---|
| 398 | |
|---|
| 399 | <P> |
|---|
| 400 | We are going to use all command line parameters from the second onward |
|---|
| 401 | as terms to search for in the database. For convenience, we shall store |
|---|
| 402 | them in an STL vector. This is probably the point at which we would want |
|---|
| 403 | to apply a stemming algorithm, or any other desired normalisation and |
|---|
| 404 | conversion operation, to the terms. |
|---|
| 405 | <PRE> |
|---|
| 406 | vector<string> queryterms; |
|---|
| 407 | for (int optpos = 2; optpos < argc; optpos++) { |
|---|
| 408 | queryterms.push_back(argv[optpos]); |
|---|
| 409 | } |
|---|
| 410 | </PRE> |
|---|
| 411 | </P> |
|---|
| 412 | |
|---|
| 413 | <P> |
|---|
| 414 | Queries are represented within Xapian by <CODE>Xapian::Query</CODE> objects, so |
|---|
| 415 | the next step is to construct one from our query terms. |
|---|
| 416 | Conveniently there is a constructor which will take our vector |
|---|
| 417 | of terms and create an <CODE>Xapian::Query</CODE> object from it. |
|---|
| 418 | <PRE> |
|---|
| 419 | Xapian::Query query(Xapian::Query::OP_OR, queryterms.begin(), queryterms.end()); |
|---|
| 420 | </PRE> |
|---|
| 421 | </P> |
|---|
| 422 | |
|---|
| 423 | <P> |
|---|
| 424 | You will notice that we had to specify an operation to be performed on |
|---|
| 425 | the terms (the <CODE>Xapian::Query::OP_OR</CODE> parameter). |
|---|
| 426 | Queries in Xapian are actually |
|---|
| 427 | fairly complex things: a full range of boolean operations can be applied to |
|---|
| 428 | queries to restrict the result set, and probabilistic weightings are then |
|---|
| 429 | applied to order the results by relevance. By specifying the OR operation, |
|---|
| 430 | we are not performing any boolean restriction, and are performing a |
|---|
| 431 | traditional pure probabilistic search. |
|---|
| 432 | </P> |
|---|
| 433 | |
|---|
| 434 | <P> |
|---|
| 435 | We now print a message out to confirm to the user what the query being |
|---|
| 436 | performed is. This is done with the <CODE>Xapian::Query::get_description()</CODE> |
|---|
| 437 | method, which is mainly included for debugging purposes, and displays |
|---|
| 438 | a string representation of the query. |
|---|
| 439 | </P> |
|---|
| 440 | <PRE> |
|---|
| 441 | cout << "Performing query `" << |
|---|
| 442 | query.get_description() << "'" << endl; |
|---|
| 443 | </PRE> |
|---|
| 444 | |
|---|
| 445 | <H3>Performing the search</H3> |
|---|
| 446 | <P> |
|---|
| 447 | Now, we are ready to perform the search. The first step of this is to |
|---|
| 448 | give the query object to the enquire session. Note that the query is |
|---|
| 449 | copied at this operation, and that changing the Xapian::Query object after |
|---|
| 450 | setting the query with it has no effect. |
|---|
| 451 | </P> |
|---|
| 452 | <PRE> |
|---|
| 453 | enquire.set_query(query); |
|---|
| 454 | </PRE> |
|---|
| 455 | |
|---|
| 456 | <P> |
|---|
| 457 | Next, we ask for the results of the search. There is no need to tell |
|---|
| 458 | Xapian to perform the search: it will do this automatically. We use |
|---|
| 459 | the <CODE>get_mset()</CODE> method to get the results, which are returned |
|---|
| 460 | in an <CODE>Xapian::MSet</CODE> object. (MSet for Match Set) |
|---|
| 461 | </P> |
|---|
| 462 | <P> |
|---|
| 463 | <CODE>get_mset()</CODE> can take many parameters, such as a set of |
|---|
| 464 | relevant documents to use, and various options to modify the search, |
|---|
| 465 | but we give it the minimum; which is the first document to return (starting |
|---|
| 466 | at 0 for the top ranked document), and the maximum number of documents |
|---|
| 467 | to return (we specify 10 here): |
|---|
| 468 | <PRE> |
|---|
| 469 | Xapian::MSet matches = enquire.get_mset(0, 10); |
|---|
| 470 | </PRE> |
|---|
| 471 | </P> |
|---|
| 472 | |
|---|
| 473 | <H3>Displaying the results of the search</H3> |
|---|
| 474 | <P> |
|---|
| 475 | Finally, we display the results of the search. The results are stored in |
|---|
| 476 | in the <CODE>Xapian::MSet</CODE> object, which provides the features required |
|---|
| 477 | to be an STL-compatible container, so first we display how many items are in |
|---|
| 478 | the MSet: |
|---|
| 479 | <PRE> |
|---|
| 480 | cout << matches.size() << " results found" << endl; |
|---|
| 481 | </PRE> |
|---|
| 482 | </P> |
|---|
| 483 | |
|---|
| 484 | <P> |
|---|
| 485 | Now we display some information about each of the items in the |
|---|
| 486 | <CODE>Xapian::MSet</CODE>. We access these items using an |
|---|
| 487 | <CODE>Xapian::MSetIterator</CODE>: |
|---|
| 488 | <UL><LI> |
|---|
| 489 | First, we display the document ID, accessed by <CODE>*i</CODE>. |
|---|
| 490 | This is not usually very useful information to give to users, but it is |
|---|
| 491 | at least a unique handle on each document. |
|---|
| 492 | </LI><LI> |
|---|
| 493 | Next, we display a "percentage" score for the document. Readers |
|---|
| 494 | familiar with Information Retrieval will not be surprised to hear that this |
|---|
| 495 | is not really a percentage: it is just a value from 0 to 100, such that a |
|---|
| 496 | more relevant document has a higher value. We get this using |
|---|
| 497 | <CODE>i.get_percent()</CODE>. |
|---|
| 498 | </LI><LI> |
|---|
| 499 | Last, we display the data associated with each returned document, which |
|---|
| 500 | was specified by the user at database generation time. To do this, we |
|---|
| 501 | first use <CODE>i.get_document()</CODE> to get an <CODE>Xapian::Document</CODE> |
|---|
| 502 | object representing the returned document; then we use the |
|---|
| 503 | <CODE>get_data()</CODE> method of this object to get |
|---|
| 504 | access to the data stored in this document. |
|---|
| 505 | </LI></UL> |
|---|
| 506 | <PRE> |
|---|
| 507 | Xapian::MSetIterator i; |
|---|
| 508 | for (i = matches.begin(); i != matches.end(); ++i) { |
|---|
| 509 | cout << "Document ID " << *i << "\t"; |
|---|
| 510 | cout << i.get_percent() << "% "; |
|---|
| 511 | Xapian::Document doc = i.get_document(); |
|---|
| 512 | cout << "[" << doc.get_data() << "]" << endl; |
|---|
| 513 | } |
|---|
| 514 | </PRE> |
|---|
| 515 | </P> |
|---|
| 516 | |
|---|
| 517 | <HR> |
|---|
| 518 | <H2>Compiling</H2> |
|---|
| 519 | |
|---|
| 520 | Now that we have the code written, all we need to do is compile it! |
|---|
| 521 | |
|---|
| 522 | <H3>Finding the Xapian library</H3> |
|---|
| 523 | |
|---|
| 524 | <P> |
|---|
| 525 | A small utility, "xapian-config", is installed along with Xapian |
|---|
| 526 | to assist you in finding the installed Xapian library, and in generating |
|---|
| 527 | the flags to pass to the compiler and linker to compile. |
|---|
| 528 | </P><P> |
|---|
| 529 | After a successful compilation, this utility should be in your path, so |
|---|
| 530 | you can simply run |
|---|
| 531 | <BLOCKQUOTE><CODE>xapian-config --cxxflags</CODE></BLOCKQUOTE> |
|---|
| 532 | to determine the flags to pass to the compiler, and |
|---|
| 533 | <BLOCKQUOTE><CODE>xapian-config --libs</CODE></BLOCKQUOTE> |
|---|
| 534 | to determine the flags to pass to the linker. |
|---|
| 535 | |
|---|
| 536 | These flags are returned on the utility's standard output (so you could use |
|---|
| 537 | backtick notation to include them on your command line). |
|---|
| 538 | </P><P> |
|---|
| 539 | If your project uses the GNU autoconf tool, you may also use the |
|---|
| 540 | <CODE>XO_LIB_XAPIAN</CODE> macro, which is included as part of Xapian, |
|---|
| 541 | and will check for an installation of Xapian and set (and |
|---|
| 542 | <CODE>AC_SUBST</CODE>) the <CODE>XAPIAN_CXXFLAGS</CODE> and |
|---|
| 543 | <CODE>XAPIAN_LIBS</CODE> variables to |
|---|
| 544 | be the flags to pass to the compiler and linker, respectively. |
|---|
| 545 | </P><P> |
|---|
| 546 | If you don't use GNU autoconf, don't worry about this. |
|---|
| 547 | </P> |
|---|
| 548 | |
|---|
| 549 | <H3>Compiling the quickstart examples</H3> |
|---|
| 550 | Once you know the compilation flags, compilation is a simple matter of |
|---|
| 551 | invoking the compiler! For our example, we could compile the two |
|---|
| 552 | utilities (quickstartindex and quickstartsearch) with the commands: |
|---|
| 553 | <PRE> |
|---|
| 554 | c++ quickstartindex.cc `xapian-config --libs --cxxflags` -o quickstartindex |
|---|
| 555 | c++ quickstartsearch.cc `xapian-config --libs --cxxflags` -o quickstartsearch |
|---|
| 556 | </PRE> |
|---|
| 557 | |
|---|
| 558 | <HR> |
|---|
| 559 | <H2>Running the examples</H2> |
|---|
| 560 | |
|---|
| 561 | <P> |
|---|
| 562 | Once we have compiled the above examples, we can build up a simple |
|---|
| 563 | database as follows. Note that we must first create a directory for |
|---|
| 564 | the database files to live in; although Xapian will create new empty |
|---|
| 565 | database files if they do not yet exist, it will not create a new |
|---|
| 566 | directory for them. |
|---|
| 567 | <PRE> |
|---|
| 568 | $ mkdir proverbs |
|---|
| 569 | $ ./quickstartindex proverbs \ |
|---|
| 570 | > "people who live in glass houses should not throw stones" \ |
|---|
| 571 | > people live glass house stone |
|---|
| 572 | $ ./quickstartindex proverbs \ |
|---|
| 573 | > "Don't look a gift horse in the mouth" \ |
|---|
| 574 | > look gift horse mouth |
|---|
| 575 | </PRE> |
|---|
| 576 | </P> |
|---|
| 577 | |
|---|
| 578 | <P> |
|---|
| 579 | Now, we should have a database with a couple of documents in it. Looking |
|---|
| 580 | in the database directory, you should see something like: |
|---|
| 581 | <PRE> |
|---|
| 582 | $ ls proverbs/ |
|---|
| 583 | <i>[some files]</i> |
|---|
| 584 | </PRE> |
|---|
| 585 | </P> |
|---|
| 586 | <P> |
|---|
| 587 | Given the small amount of data in the database, you may be concerned that |
|---|
| 588 | the total size of these files is somewhat over 50k. Be reassured that the |
|---|
| 589 | database is block structured, here consisting of largely empty |
|---|
| 590 | blocks, and will behave much better for large databases. |
|---|
| 591 | </P> |
|---|
| 592 | |
|---|
| 593 | <P> |
|---|
| 594 | We can now perform searches over the database using the quickstartsearch |
|---|
| 595 | program. |
|---|
| 596 | <PRE> |
|---|
| 597 | $ ./quickstartsearch proverbs look |
|---|
| 598 | Performing query `look' |
|---|
| 599 | 1 results found |
|---|
| 600 | Document ID 2 50% [Don't look a gift horse in the mouth] |
|---|
| 601 | </PRE> |
|---|
| 602 | </P> |
|---|
| 603 | |
|---|
| 604 | <!-- FOOTER $Author$ $Date$ $Id$ --> |
|---|
| 605 | </BODY> |
|---|
| 606 | </HTML> |
|---|