| 1 | .. This document was originally written by Richard Boulton, with funding |
|---|
| 2 | .. provided by Enfold Systems. |
|---|
| 3 | |
|---|
| 4 | .. Copyright (C) 2006 Lemur Consulting Ltd |
|---|
| 5 | .. Copyright (C) 2007 Olly Betts |
|---|
| 6 | |
|---|
| 7 | ============================ |
|---|
| 8 | Xapian Administrator's Guide |
|---|
| 9 | ============================ |
|---|
| 10 | |
|---|
| 11 | .. contents:: Table of contents |
|---|
| 12 | |
|---|
| 13 | Introduction |
|---|
| 14 | ============ |
|---|
| 15 | |
|---|
| 16 | This document is intended to provide general hints, tips and advice to |
|---|
| 17 | administrators of Xapian systems. It assumes that you have installed Xapian |
|---|
| 18 | on your system, and are familiar with the basics of creating and searching |
|---|
| 19 | Xapian databases. |
|---|
| 20 | |
|---|
| 21 | The intended audience is system administrators who need to be able to perform |
|---|
| 22 | general management of a Xapian database, including tasks such as taking |
|---|
| 23 | backups and optimising performance. It may also be useful introductory |
|---|
| 24 | reading for Xapian application developers. |
|---|
| 25 | |
|---|
| 26 | The document is up-to-date for Xapian version 1.0.5. |
|---|
| 27 | |
|---|
| 28 | Databases |
|---|
| 29 | ========= |
|---|
| 30 | |
|---|
| 31 | Xapian databases hold all the information needed to perform searches in a set |
|---|
| 32 | of tables. The following tables always exist: |
|---|
| 33 | |
|---|
| 34 | - A posting list table, which holds a list of all the documents indexed by |
|---|
| 35 | each term in the database. |
|---|
| 36 | - A record table, which holds the document data associated with each document |
|---|
| 37 | in the database. |
|---|
| 38 | - A termlist table, which holds a list of all the terms which index each |
|---|
| 39 | document. |
|---|
| 40 | |
|---|
| 41 | And the following optional tables exist only when there is data to store in |
|---|
| 42 | them (in 1.0.1 and earlier, the position and value tables were always created |
|---|
| 43 | even if empty; spelling and synonym tables are new in 1.0.2): |
|---|
| 44 | |
|---|
| 45 | - A position list table, which holds a list of all the word positions in each |
|---|
| 46 | document which each term occurs at. |
|---|
| 47 | - A value table, which holds the "values" (used for sorting, collapsing, and |
|---|
| 48 | other match-time calculations) associated with each document in the |
|---|
| 49 | database. |
|---|
| 50 | - A spelling table, which holds data for suggesting spelling corrections. |
|---|
| 51 | - A synonym table, which holds a synonym dictionary. |
|---|
| 52 | |
|---|
| 53 | Each of the tables is held in a separate file, allowing an administrator to |
|---|
| 54 | see how much data is being used for each of the above purposes. It is not |
|---|
| 55 | always necessary to fully populate these tables: for example, if phrase |
|---|
| 56 | searches are never going to be performed on the database, it is not necessary |
|---|
| 57 | to store any positionlist information. |
|---|
| 58 | |
|---|
| 59 | If you look at a Xapian database, you will see that each of these tables |
|---|
| 60 | actually uses 2 or 3 files. For example, for a "flint" format database the |
|---|
| 61 | termlist table is stored in the files "termlist.baseA", "termlist.baseB" |
|---|
| 62 | and "termlist.DB". |
|---|
| 63 | |
|---|
| 64 | Of these files, only the ".DB" file actually stores the data. The ".baseA" |
|---|
| 65 | and ".baseB" files are used to keep track of where to start looking for that |
|---|
| 66 | data in the ".DB" file. Often, only one of the ".baseA" and ".baseB" files |
|---|
| 67 | will be present; each of these files refers to a revision of the database, and |
|---|
| 68 | there may be more than one valid revision of the database stored in the ".DB" |
|---|
| 69 | file at once. |
|---|
| 70 | |
|---|
| 71 | The ".DB" file is structured as a set of blocks, which have a default size of |
|---|
| 72 | 8KB (though this can be set, either through the Xapian API, or with some of |
|---|
| 73 | the tools detailed later in this document). The first block is used for |
|---|
| 74 | header information, so a ".DB" file with only a single entry will be 16KB in |
|---|
| 75 | size. |
|---|
| 76 | |
|---|
| 77 | Changing the blocksize may have performance implications, but it is hard to |
|---|
| 78 | tell whether these will be positive or negative for a particular combination |
|---|
| 79 | of hardware and software without doing some profiling. |
|---|
| 80 | |
|---|
| 81 | Atomic modifications |
|---|
| 82 | -------------------- |
|---|
| 83 | |
|---|
| 84 | Xapian ensures that all modifications to its database are performed |
|---|
| 85 | atomically. This means that: |
|---|
| 86 | |
|---|
| 87 | - From the point of view of a separate process (or a separate database object |
|---|
| 88 | in the same process) reading the database, all modifications made to a |
|---|
| 89 | database are invisible until the modifications is committed. |
|---|
| 90 | - The database on disk is always in a consistent state. |
|---|
| 91 | - If the system is interrupted during a modification, the database should |
|---|
| 92 | always be left in a valid state. This applies even if the power is cut |
|---|
| 93 | unexpectedly, as long as the disk does not become corrupted due to hardware |
|---|
| 94 | failure. |
|---|
| 95 | |
|---|
| 96 | Committing a modification requires several calls to the operating system to |
|---|
| 97 | make it flush any cached modifications to the database to disk. This is to |
|---|
| 98 | ensure that if the system fails at any point, the database is left in a |
|---|
| 99 | consistent state. Of course, this is a fairly slow process (since the system |
|---|
| 100 | has to wait for the disk to physically write the data), so grouping many |
|---|
| 101 | changes together will speed up the throughput considerably. |
|---|
| 102 | |
|---|
| 103 | Many modifications can be explicitly grouped into a single transaction, so |
|---|
| 104 | that lots of changes are applied at once. Even if an application doesn't |
|---|
| 105 | explicitly protect modifications to the database using transactions, Xapian |
|---|
| 106 | will group modifications into transactions, applying the modifications in |
|---|
| 107 | batches. |
|---|
| 108 | |
|---|
| 109 | Note that it is not currently possible to extend Xapian's transactions to |
|---|
| 110 | cover multiple databases, or to link them with transactions in external |
|---|
| 111 | systems, such as an RDBMS. |
|---|
| 112 | |
|---|
| 113 | Finally, note that it is possible to compile Xapian such that it doesn't make |
|---|
| 114 | modifications in an atomic manner, in order to build very large databases more |
|---|
| 115 | quickly (search the Xapian mailing list archives for "DANGEROUS" mode for more |
|---|
| 116 | details). This isn't yet integrated into standard builds of Xapian, but may |
|---|
| 117 | be in future, if appropriate protections can be incorporated. |
|---|
| 118 | |
|---|
| 119 | Single writer, multiple reader |
|---|
| 120 | ------------------------------ |
|---|
| 121 | |
|---|
| 122 | Xapian implements a "single writer, multiple reader" model. This means that, |
|---|
| 123 | at any given instant, there is only permitted to be a single object modifying |
|---|
| 124 | a database, but there may (simultaneously) be many objects reading the |
|---|
| 125 | database at once. |
|---|
| 126 | |
|---|
| 127 | Xapian enforces this restriction using lock-files. For a flint database, each |
|---|
| 128 | Xapian database directory contains a lock file named ``flintlock``. The |
|---|
| 129 | lock-file will always exist, but will be locked using ``fcntl()`` when the |
|---|
| 130 | database is open for writing. If a writer exits without being given a |
|---|
| 131 | chance to clean up (for example, if the application holding the writer |
|---|
| 132 | is killed), the ``fcntl()`` lock will be automatically released by the operating |
|---|
| 133 | system. Under Microsoft Windows, we use a different locking technique, but |
|---|
| 134 | with the same features. |
|---|
| 135 | |
|---|
| 136 | Revision numbers |
|---|
| 137 | ---------------- |
|---|
| 138 | |
|---|
| 139 | Xapian databases contain a revision number. This is essentially a count of |
|---|
| 140 | the number of modifications since the database was created, and is needed to |
|---|
| 141 | implement the atomic modification functionality. It is stored as a 32 bit |
|---|
| 142 | integer, so there is a chance that a very frequently updated database could |
|---|
| 143 | cause this to overflow. The consequence of such an overflow would be to throw |
|---|
| 144 | database errors. |
|---|
| 145 | |
|---|
| 146 | This isn't likely to be a practical problem, since it would take nearly a year |
|---|
| 147 | for a database to reach this limit if 100 modifications were committed every |
|---|
| 148 | second, and no normal Xapian system will commit more than once every few |
|---|
| 149 | seconds. However, if you are concerned, you can use the ``xapian-compact`` |
|---|
| 150 | tool to make a fresh copy of the database with the revision number set to 1. |
|---|
| 151 | |
|---|
| 152 | For a "flint" database, the revision number of each table can be displayed by |
|---|
| 153 | the ``xapian-check`` tool. |
|---|
| 154 | |
|---|
| 155 | Network file systems |
|---|
| 156 | -------------------- |
|---|
| 157 | |
|---|
| 158 | Xapian should work correctly over a network file system. However, there are a |
|---|
| 159 | large number of potential issues with such file systems, so we recommend |
|---|
| 160 | extensive testing of your particular network file system before deployment. |
|---|
| 161 | |
|---|
| 162 | Be warned that Xapian is heavily I/O dependent, and therefore performance over |
|---|
| 163 | a network file system is likely to be slow unless you've got a very well tuned |
|---|
| 164 | setup. |
|---|
| 165 | |
|---|
| 166 | Xapian needs to be able to create a lock file in a database directory when |
|---|
| 167 | modifications are being performed. On some network files systems (e.g., NFS) |
|---|
| 168 | this requires a lock daemon to be running. |
|---|
| 169 | |
|---|
| 170 | Which database format to use? |
|---|
| 171 | ----------------------------- |
|---|
| 172 | |
|---|
| 173 | As of release 1.0.0, you should use the flint format (which is now the |
|---|
| 174 | default). The quartz format is now deprecated and support is scheduled |
|---|
| 175 | for removal in 1.1.0. |
|---|
| 176 | |
|---|
| 177 | Can I put other files in the database directory? |
|---|
| 178 | ------------------------------------------------ |
|---|
| 179 | |
|---|
| 180 | If you wish to store meta-data or other information relating to the Xapian |
|---|
| 181 | database, it is reasonable to wish to put this in files inside the Xapian |
|---|
| 182 | database directory, for neatness. For example, you might wish to store a list |
|---|
| 183 | of the prefixes you've applied to terms for specific fields in the database. |
|---|
| 184 | |
|---|
| 185 | Xapian's "flint" backend doesn't perform any operations |
|---|
| 186 | which will break this technique, so as long as you don't choose a filename |
|---|
| 187 | that Xapian uses itself, there should be no problems. However, be aware that |
|---|
| 188 | new versions of Xapian may use new files in the database directory, and it is |
|---|
| 189 | also possible that new backend formats may not be compatible with the |
|---|
| 190 | technique (e.g., it is possible that a future backend could store its entire |
|---|
| 191 | database in a single file, not in a directory). |
|---|
| 192 | |
|---|
| 193 | |
|---|
| 194 | Backup Strategies |
|---|
| 195 | ================= |
|---|
| 196 | |
|---|
| 197 | Summary |
|---|
| 198 | ------- |
|---|
| 199 | |
|---|
| 200 | - The simplest way to perform a backup is to temporarily halt modifications, |
|---|
| 201 | take a copy of all files in the database directory, and then allow |
|---|
| 202 | modifications to resume. Read access can continue while a backup is being |
|---|
| 203 | taken. |
|---|
| 204 | |
|---|
| 205 | - If you have a filesystem which allows atomic snapshots to be taken of |
|---|
| 206 | directories (such as an LVM filesystem), an alternative strategy is to take |
|---|
| 207 | a snapshot and simply copy all the files in the database directory to the |
|---|
| 208 | backup medium. Such a copy will always be a valid database. |
|---|
| 209 | |
|---|
| 210 | - Progressive backups are not easily possible; modifications are typically |
|---|
| 211 | spread throughout the database files. |
|---|
| 212 | |
|---|
| 213 | Detail |
|---|
| 214 | ------ |
|---|
| 215 | |
|---|
| 216 | Even though Xapian databases are often automatically generated from source |
|---|
| 217 | data which is stored in a reliable manner, it is usually desirable to keep |
|---|
| 218 | backups of Xapian databases being run in production environments. This is |
|---|
| 219 | particularly important in systems with high-availability requirements, since |
|---|
| 220 | re-building a Xapian database from scratch can take many hours. It is also |
|---|
| 221 | important in the case where the data stored in the database cannot easily be |
|---|
| 222 | recovered from external sources. |
|---|
| 223 | |
|---|
| 224 | Xapian databases are managed such that at any instant in time, there is at |
|---|
| 225 | least one valid revision of the database written to disk (and if there are |
|---|
| 226 | multiple valid revisions, Xapian will always open the most recent). |
|---|
| 227 | Therefore, if it is possible to take an instantaneous snapshot of all the |
|---|
| 228 | database files (for example, on an LVM filesystem), this snapshot is suitable |
|---|
| 229 | for copying to a backup medium. Note that it is not sufficient to take a |
|---|
| 230 | snapshot of each database file in turn - the snapshot must be across all |
|---|
| 231 | database files simultaneously. Otherwise, there is a risk that the snapshot |
|---|
| 232 | could contain database files from different revisions. |
|---|
| 233 | |
|---|
| 234 | If it is not possible to take an instantaneous snapshot, the best backup |
|---|
| 235 | strategy is simply to ensure that no modifications are committed during the |
|---|
| 236 | backup procedure. While the simplest way to implement this may be to stop |
|---|
| 237 | whatever processes are used to modify the database, and ensure that they close |
|---|
| 238 | the database, it is not actually necessary to ensure that no writers are open |
|---|
| 239 | on the database; it is enough to ensure that no writer makes any modification |
|---|
| 240 | to the database. |
|---|
| 241 | |
|---|
| 242 | Because a Xapian database can contain more than one valid revision of the |
|---|
| 243 | database, it is actually possible to allow a limited number of modifications |
|---|
| 244 | to be performed while a backup copy is being made, but this is tricky and we |
|---|
| 245 | do not recommend relying on it. Future versions of Xapian are likely to |
|---|
| 246 | support this better, by allowing the current revision of a database to be |
|---|
| 247 | preserved while modifications continue. |
|---|
| 248 | |
|---|
| 249 | Progressive backups are not recommended for Xapian databases: Xapian database |
|---|
| 250 | files are block-structured, and modifications are spread throughout the |
|---|
| 251 | database file. Therefore, a progressive backup tool will not be able to take |
|---|
| 252 | a backup by storing only the new parts of the database. Modifications will |
|---|
| 253 | normally be so extensive that most parts of the database have been modified, |
|---|
| 254 | however, if only a small number of modifications have been made, a binary diff |
|---|
| 255 | algorithm might make a usable progressive backup tool. |
|---|
| 256 | |
|---|
| 257 | |
|---|
| 258 | Inspecting a database |
|---|
| 259 | ===================== |
|---|
| 260 | |
|---|
| 261 | When designing an indexing strategy, it is often useful to be able to check |
|---|
| 262 | the contents of the database. Xapian includes a simple command-line program, |
|---|
| 263 | "delve", to allow this. |
|---|
| 264 | |
|---|
| 265 | For example, to display the list of terms in document "1" of the database |
|---|
| 266 | "foo", use:: |
|---|
| 267 | |
|---|
| 268 | delve foo -r 1 |
|---|
| 269 | |
|---|
| 270 | It is also possible to perform simple searches of a database. Xapian includes |
|---|
| 271 | another simple command-line program, "quest", to support this. "quest" is |
|---|
| 272 | only able to search for un-prefixed terms, the query string must be quoted to |
|---|
| 273 | protect it from the shell. To search the database "foo" for the phrase "hello |
|---|
| 274 | world", use:: |
|---|
| 275 | |
|---|
| 276 | quest -d foo '"hello world"' |
|---|
| 277 | |
|---|
| 278 | If you have installed the "Omega" CGI application built on Xapian, this can |
|---|
| 279 | also be used with the built-in "godmode" template to provide a web-based |
|---|
| 280 | interface for browsing a database. See Omega's documentation for more details |
|---|
| 281 | on this. |
|---|
| 282 | |
|---|
| 283 | Database maintenance |
|---|
| 284 | ==================== |
|---|
| 285 | |
|---|
| 286 | Compacting a database |
|---|
| 287 | --------------------- |
|---|
| 288 | |
|---|
| 289 | Xapian databases normally have some spare space in each block to allow |
|---|
| 290 | new information to be efficiently slotted into the database. However, the |
|---|
| 291 | smaller a database is, the faster it can be searched, so if there aren't |
|---|
| 292 | expected to be many further modifications, it can be desirable to compact the |
|---|
| 293 | database. |
|---|
| 294 | |
|---|
| 295 | Xapian includes a tool, "xapian-compact" for compacting "flint" format |
|---|
| 296 | databases. |
|---|
| 297 | This tool makes a copy of a database, and takes advantage of the sorted nature |
|---|
| 298 | of the source Xapian database to write the database out without leaving so |
|---|
| 299 | much space for future modifications. This can result in a large space saving. |
|---|
| 300 | |
|---|
| 301 | The downside of these tools is that future modifications may take a little |
|---|
| 302 | longer, due to needing to reorganise the database to make space for them. |
|---|
| 303 | However, modifications are still possible, and if many modifications are made, |
|---|
| 304 | the database will eventually adjust itself. |
|---|
| 305 | |
|---|
| 306 | The tools have an option ("-F") to perform a "fuller" compaction. This option |
|---|
| 307 | compacts the database as much as possible, but it violates the design of the |
|---|
| 308 | Btree format slightly to achieve this, so it is not recommended if further |
|---|
| 309 | modifications are at all likely in future. If you do need to modify a "fuller" |
|---|
| 310 | compacted database, we recommend you run xapian-compact on it without "-F" |
|---|
| 311 | first. |
|---|
| 312 | |
|---|
| 313 | While taking a copy of the database, it is also possible to change the |
|---|
| 314 | blocksize. If you wish to profile search speed with different blocksizes, |
|---|
| 315 | this is the recommended way to generate the different databases (but remember |
|---|
| 316 | to compact the original database as well, for a fair comparison). |
|---|
| 317 | |
|---|
| 318 | |
|---|
| 319 | Merging databases |
|---|
| 320 | ----------------- |
|---|
| 321 | |
|---|
| 322 | When building an index for a very large amount of data, it can be desirable to |
|---|
| 323 | index the data in smaller chunks (perhaps on separate machines), and then |
|---|
| 324 | merge the chunks together into a single database. This can also be performed |
|---|
| 325 | using the "xapian-compact" tool, simply by supplying |
|---|
| 326 | several source database paths. |
|---|
| 327 | |
|---|
| 328 | Normally, merging works by reading the source databases in parallel, and |
|---|
| 329 | writing the contents in sorted order to the destination database. This will |
|---|
| 330 | work most efficiently if excessive disk seeking can be avoided; if you have |
|---|
| 331 | several disks, it may be worth placing the source databases and the |
|---|
| 332 | destination database on separate disks to obtain maximum speed. |
|---|
| 333 | |
|---|
| 334 | The ``xapian-compact`` tool supports an additional option, ``--multipass``, |
|---|
| 335 | which is useful when merging more than three databases. This will cause the |
|---|
| 336 | postlist tables to be grouped and merged into temporary tables, which are then |
|---|
| 337 | grouped and merged, and so on until a single postlist table is created, which |
|---|
| 338 | is usually faster, but requires more disk space for the temporary files. |
|---|
| 339 | |
|---|
| 340 | |
|---|
| 341 | Checking database integrity |
|---|
| 342 | --------------------------- |
|---|
| 343 | |
|---|
| 344 | Xapian includes a command-line tool to check that a flint database is |
|---|
| 345 | self-consistent. This tool, "xapian-check", runs through the entire database, |
|---|
| 346 | checking that all the internal nodes are correctly connected. It can also be |
|---|
| 347 | used on a single table in a flint database, by specifying the prefix of the |
|---|
| 348 | table: for example, this command will check the termlist table of database "foo":: |
|---|
| 349 | |
|---|
| 350 | xapian-check foo/termlist |
|---|
| 351 | |
|---|
| 352 | |
|---|
| 353 | Converting a quartz database to a flint database |
|---|
| 354 | ------------------------------------------------ |
|---|
| 355 | |
|---|
| 356 | It is possible to convert a quartz database to a flint database using the |
|---|
| 357 | "copydatabase" example program included with Xapian. This is a lot slower to |
|---|
| 358 | run than "quartzcompact" or "xapian-compact", since it has to perform the |
|---|
| 359 | sorting of the term occurrence data from scratch, but should be faster than a |
|---|
| 360 | re-index from source data since it doesn't need to perform the tokenisation |
|---|
| 361 | step. It is also useful if you no longer have the source data available. |
|---|
| 362 | |
|---|
| 363 | The following command will copy a database from "SOURCE" to "DESTINATION", |
|---|
| 364 | creating the new database at "DESTINATION" as a flint database:: |
|---|
| 365 | |
|---|
| 366 | copydatabase SOURCE DESTINATION |
|---|
| 367 | |
|---|
| 368 | |
|---|
| 369 | Converting a 0.9.x flint database to work with 1.0.y |
|---|
| 370 | ---------------------------------------------------- |
|---|
| 371 | |
|---|
| 372 | Due to a bug in the flint position list encoding in 0.9.x which made flint |
|---|
| 373 | databases non-portable between platforms, we had to make an incompatible |
|---|
| 374 | change in the flint format. It's not easy to write an upgrader, but you |
|---|
| 375 | can convert a database using the following procedure (although it might |
|---|
| 376 | be better to rebuild from scratch if you want to use the new UTF-8 support |
|---|
| 377 | in Xapian::QueryParser, Xapian::Stem, and Xapian::TermGenerator). |
|---|
| 378 | |
|---|
| 379 | Run the following command in your Xapian 0.9.x installation to copy your |
|---|
| 380 | 0.9.x flint database "SOURCE" to a new quartz database "INTERMEDIATE":: |
|---|
| 381 | |
|---|
| 382 | copydatabase SOURCE INTERMEDIATE |
|---|
| 383 | |
|---|
| 384 | Then run the following command in your Xapian 1.0.y installation to copy |
|---|
| 385 | your quartz database to a 1.0.y flint database "DESTINATION":: |
|---|
| 386 | |
|---|
| 387 | copydatabase INTERMEDIATE DESTINATION |
|---|