root / tags / 1.0.8 / xapian-core / docs / admin_notes.rst

Revision 9848, 17.6 kB (checked in by olly, 13 months ago)

docs/admin_notes.rst: Mark as up to date for Xapian 1.0.5. Minor
wording improvements.

Line 
1.. This document was originally written by Richard Boulton, with funding
2.. provided by Enfold Systems.
3
4.. Copyright (C) 2006 Lemur Consulting Ltd
5.. Copyright (C) 2007 Olly Betts
6
7============================
8Xapian Administrator's Guide
9============================
10
11.. contents:: Table of contents
12
13Introduction
14============
15
16This document is intended to provide general hints, tips and advice to
17administrators of Xapian systems.  It assumes that you have installed Xapian
18on your system, and are familiar with the basics of creating and searching
19Xapian databases.
20
21The intended audience is system administrators who need to be able to perform
22general management of a Xapian database, including tasks such as taking
23backups and optimising performance.  It may also be useful introductory
24reading for Xapian application developers.
25
26The document is up-to-date for Xapian version 1.0.5.
27
28Databases
29=========
30
31Xapian databases hold all the information needed to perform searches in a set
32of tables.  The following tables always exist:
33
34 - A posting list table, which holds a list of all the documents indexed by
35   each term in the database.
36 - A record table, which holds the document data associated with each document
37   in the database.
38 - A termlist table, which holds a list of all the terms which index each
39   document.
40
41And the following optional tables exist only when there is data to store in
42them (in 1.0.1 and earlier, the position and value tables were always created
43even if empty; spelling and synonym tables are new in 1.0.2):
44
45 - A position list table, which holds a list of all the word positions in each
46   document which each term occurs at.
47 - A value table, which holds the "values" (used for sorting, collapsing, and
48   other match-time calculations) associated with each document in the
49   database.
50 - A spelling table, which holds data for suggesting spelling corrections.
51 - A synonym table, which holds a synonym dictionary.
52
53Each of the tables is held in a separate file, allowing an administrator to
54see how much data is being used for each of the above purposes.  It is not
55always necessary to fully populate these tables: for example, if phrase
56searches are never going to be performed on the database, it is not necessary
57to store any positionlist information.
58
59If you look at a Xapian database, you will see that each of these tables
60actually uses 2 or 3 files.  For example, for a "flint" format database the
61termlist table is stored in the files "termlist.baseA", "termlist.baseB"
62and "termlist.DB".
63
64Of these files, only the ".DB" file actually stores the data.  The ".baseA"
65and ".baseB" files are used to keep track of where to start looking for that
66data in the ".DB" file.  Often, only one of the ".baseA" and ".baseB" files
67will be present; each of these files refers to a revision of the database, and
68there may be more than one valid revision of the database stored in the ".DB"
69file at once.
70
71The ".DB" file is structured as a set of blocks, which have a default size of
728KB (though this can be set, either through the Xapian API, or with some of
73the tools detailed later in this document).  The first block is used for
74header information, so a ".DB" file with only a single entry will be 16KB in
75size.
76
77Changing the blocksize may have performance implications, but it is hard to
78tell whether these will be positive or negative for a particular combination
79of hardware and software without doing some profiling.
80
81Atomic modifications
82--------------------
83
84Xapian ensures that all modifications to its database are performed
85atomically.  This means that:
86
87 - From the point of view of a separate process (or a separate database object
88   in the same process) reading the database, all modifications made to a
89   database are invisible until the modifications is committed.
90 - The database on disk is always in a consistent state.
91 - If the system is interrupted during a modification, the database should
92   always be left in a valid state.  This applies even if the power is cut
93   unexpectedly, as long as the disk does not become corrupted due to hardware
94   failure.
95
96Committing a modification requires several calls to the operating system to
97make it flush any cached modifications to the database to disk.  This is to
98ensure that if the system fails at any point, the database is left in a
99consistent state.  Of course, this is a fairly slow process (since the system
100has to wait for the disk to physically write the data), so grouping many
101changes together will speed up the throughput considerably.
102
103Many modifications can be explicitly grouped into a single transaction, so
104that lots of changes are applied at once.  Even if an application doesn't
105explicitly protect modifications to the database using transactions, Xapian
106will group modifications into transactions, applying the modifications in
107batches.
108
109Note that it is not currently possible to extend Xapian's transactions to
110cover multiple databases, or to link them with transactions in external
111systems, such as an RDBMS.
112
113Finally, note that it is possible to compile Xapian such that it doesn't make
114modifications in an atomic manner, in order to build very large databases more
115quickly (search the Xapian mailing list archives for "DANGEROUS" mode for more
116details).  This isn't yet integrated into standard builds of Xapian, but may
117be in future, if appropriate protections can be incorporated.
118
119Single writer, multiple reader
120------------------------------
121
122Xapian implements a "single writer, multiple reader" model.  This means that,
123at any given instant, there is only permitted to be a single object modifying
124a database, but there may (simultaneously) be many objects reading the
125database at once.
126
127Xapian enforces this restriction using lock-files.  For a flint database, each
128Xapian database directory contains a lock file named ``flintlock``.  The
129lock-file will always exist, but will be locked using ``fcntl()`` when the
130database is open for writing.  If a writer exits without being given a
131chance to clean up (for example, if the application holding the writer
132is killed), the ``fcntl()`` lock will be automatically released by the operating
133system.  Under Microsoft Windows, we use a different locking technique, but
134with the same features.
135
136Revision numbers
137----------------
138
139Xapian databases contain a revision number.  This is essentially a count of
140the number of modifications since the database was created, and is needed to
141implement the atomic modification functionality.  It is stored as a 32 bit
142integer, so there is a chance that a very frequently updated database could
143cause this to overflow.  The consequence of such an overflow would be to throw
144database errors.
145
146This isn't likely to be a practical problem, since it would take nearly a year
147for a database to reach this limit if 100 modifications were committed every
148second, and no normal Xapian system will commit more than once every few
149seconds.  However, if you are concerned, you can use the ``xapian-compact``
150tool to make a fresh copy of the database with the revision number set to 1.
151
152For a "flint" database, the revision number of each table can be displayed by
153the ``xapian-check`` tool.
154
155Network file systems
156--------------------
157
158Xapian should work correctly over a network file system.  However, there are a
159large number of potential issues with such file systems, so we recommend
160extensive testing of your particular network file system before deployment.
161
162Be warned that Xapian is heavily I/O dependent, and therefore performance over
163a network file system is likely to be slow unless you've got a very well tuned
164setup.
165
166Xapian needs to be able to create a lock file in a database directory when
167modifications are being performed.  On some network files systems (e.g., NFS)
168this requires a lock daemon to be running.
169
170Which database format to use?
171-----------------------------
172
173As of release 1.0.0, you should use the flint format (which is now the
174default).  The quartz format is now deprecated and support is scheduled
175for removal in 1.1.0.
176
177Can I put other files in the database directory?
178------------------------------------------------
179
180If you wish to store meta-data or other information relating to the Xapian
181database, it is reasonable to wish to put this in files inside the Xapian
182database directory, for neatness.  For example, you might wish to store a list
183of the prefixes you've applied to terms for specific fields in the database.
184
185Xapian's "flint" backend doesn't perform any operations
186which will break this technique, so as long as you don't choose a filename
187that Xapian uses itself, there should be no problems.  However, be aware that
188new versions of Xapian may use new files in the database directory, and it is
189also possible that new backend formats may not be compatible with the
190technique (e.g., it is possible that a future backend could store its entire
191database in a single file, not in a directory).
192
193
194Backup Strategies
195=================
196
197Summary
198-------
199
200 - The simplest way to perform a backup is to temporarily halt modifications,
201   take a copy of all files in the database directory, and then allow
202   modifications to resume.  Read access can continue while a backup is being
203   taken.
204
205 - If you have a filesystem which allows atomic snapshots to be taken of
206   directories (such as an LVM filesystem), an alternative strategy is to take
207   a snapshot and simply copy all the files in the database directory to the
208   backup medium.  Such a copy will always be a valid database.
209
210 - Progressive backups are not easily possible; modifications are typically
211   spread throughout the database files.
212
213Detail
214------
215
216Even though Xapian databases are often automatically generated from source
217data which is stored in a reliable manner, it is usually desirable to keep
218backups of Xapian databases being run in production environments.  This is
219particularly important in systems with high-availability requirements, since
220re-building a Xapian database from scratch can take many hours.  It is also
221important in the case where the data stored in the database cannot easily be
222recovered from external sources.
223
224Xapian databases are managed such that at any instant in time, there is at
225least one valid revision of the database written to disk (and if there are
226multiple valid revisions, Xapian will always open the most recent).
227Therefore, if it is possible to take an instantaneous snapshot of all the
228database files (for example, on an LVM filesystem), this snapshot is suitable
229for copying to a backup medium.  Note that it is not sufficient to take a
230snapshot of each database file in turn - the snapshot must be across all
231database files simultaneously.  Otherwise, there is a risk that the snapshot
232could contain database files from different revisions.
233
234If it is not possible to take an instantaneous snapshot, the best backup
235strategy is simply to ensure that no modifications are committed during the
236backup procedure.  While the simplest way to implement this may be to stop
237whatever processes are used to modify the database, and ensure that they close
238the database, it is not actually necessary to ensure that no writers are open
239on the database; it is enough to ensure that no writer makes any modification
240to the database.
241
242Because a Xapian database can contain more than one valid revision of the
243database, it is actually possible to allow a limited number of modifications
244to be performed while a backup copy is being made, but this is tricky and we
245do not recommend relying on it.  Future versions of Xapian are likely to
246support this better, by allowing the current revision of a database to be
247preserved while modifications continue.
248
249Progressive backups are not recommended for Xapian databases: Xapian database
250files are block-structured, and modifications are spread throughout the
251database file.  Therefore, a progressive backup tool will not be able to take
252a backup by storing only the new parts of the database.  Modifications will
253normally be so extensive that most parts of the database have been modified,
254however, if only a small number of modifications have been made, a binary diff
255algorithm might make a usable progressive backup tool.
256
257
258Inspecting a database
259=====================
260
261When designing an indexing strategy, it is often useful to be able to check
262the contents of the database.  Xapian includes a simple command-line program,
263"delve", to allow this.
264
265For example, to display the list of terms in document "1" of the database
266"foo", use::
267
268  delve foo -r 1
269
270It is also possible to perform simple searches of a database.  Xapian includes
271another simple command-line program, "quest", to support this.  "quest" is
272only able to search for un-prefixed terms, the query string must be quoted to
273protect it from the shell.  To search the database "foo" for the phrase "hello
274world", use::
275
276  quest -d foo '"hello world"'
277
278If you have installed the "Omega" CGI application built on Xapian, this can
279also be used with the built-in "godmode" template to provide a web-based
280interface for browsing a database.  See Omega's documentation for more details
281on this.
282
283Database maintenance
284====================
285
286Compacting a database
287---------------------
288
289Xapian databases normally have some spare space in each block to allow
290new information to be efficiently slotted into the database.  However, the
291smaller a database is, the faster it can be searched, so if there aren't
292expected to be many further modifications, it can be desirable to compact the
293database.
294
295Xapian includes a tool, "xapian-compact" for compacting "flint" format
296databases.
297This tool makes a copy of a database, and takes advantage of the sorted nature
298of the source Xapian database to write the database out without leaving so
299much space for future modifications.  This can result in a large space saving.
300
301The downside of these tools is that future modifications may take a little
302longer, due to needing to reorganise the database to make space for them.
303However, modifications are still possible, and if many modifications are made,
304the database will eventually adjust itself.
305
306The tools have an option ("-F") to perform a "fuller" compaction.  This option
307compacts the database as much as possible, but it violates the design of the
308Btree format slightly to achieve this, so it is not recommended if further
309modifications are at all likely in future.  If you do need to modify a "fuller"
310compacted database, we recommend you run xapian-compact on it without "-F"
311first.
312
313While taking a copy of the database, it is also possible to change the
314blocksize.  If you wish to profile search speed with different blocksizes,
315this is the recommended way to generate the different databases (but remember
316to compact the original database as well, for a fair comparison).
317
318
319Merging databases
320-----------------
321
322When building an index for a very large amount of data, it can be desirable to
323index the data in smaller chunks (perhaps on separate machines), and then
324merge the chunks together into a single database.  This can also be performed
325using the "xapian-compact" tool, simply by supplying
326several source database paths.
327
328Normally, merging works by reading the source databases in parallel, and
329writing the contents in sorted order to the destination database.  This will
330work most efficiently if excessive disk seeking can be avoided; if you have
331several disks, it may be worth placing the source databases and the
332destination database on separate disks to obtain maximum speed.
333
334The ``xapian-compact`` tool supports an additional option, ``--multipass``,
335which is useful when merging more than three databases.  This will cause the
336postlist tables to be grouped and merged into temporary tables, which are then
337grouped and merged, and so on until a single postlist table is created, which
338is usually faster, but requires more disk space for the temporary files.
339
340
341Checking database integrity
342---------------------------
343
344Xapian includes a command-line tool to check that a flint database is
345self-consistent.  This tool, "xapian-check", runs through the entire database,
346checking that all the internal nodes are correctly connected.  It can also be
347used on a single table in a flint database, by specifying the prefix of the
348table: for example, this command will check the termlist table of database "foo"::
349
350  xapian-check foo/termlist
351
352
353Converting a quartz database to a flint database
354------------------------------------------------
355
356It is possible to convert a quartz database to a flint database using the
357"copydatabase" example program included with Xapian.  This is a lot slower to
358run than "quartzcompact" or "xapian-compact", since it has to perform the
359sorting of the term occurrence data from scratch, but should be faster than a
360re-index from source data since it doesn't need to perform the tokenisation
361step.  It is also useful if you no longer have the source data available.
362
363The following command will copy a database from "SOURCE" to "DESTINATION",
364creating the new database at "DESTINATION" as a flint database::
365
366  copydatabase SOURCE DESTINATION
367
368
369Converting a 0.9.x flint database to work with 1.0.y
370----------------------------------------------------
371
372Due to a bug in the flint position list encoding in 0.9.x which made flint
373databases non-portable between platforms, we had to make an incompatible
374change in the flint format.  It's not easy to write an upgrader, but you
375can convert a database using the following procedure (although it might
376be better to rebuild from scratch if you want to use the new UTF-8 support
377in Xapian::QueryParser, Xapian::Stem, and Xapian::TermGenerator).
378
379Run the following command in your Xapian 0.9.x installation to copy your
3800.9.x flint database "SOURCE" to a new quartz database "INTERMEDIATE"::
381
382  copydatabase SOURCE INTERMEDIATE
383
384Then run the following command in your Xapian 1.0.y installation to copy
385your quartz database to a 1.0.y flint database "DESTINATION"::
386
387  copydatabase INTERMEDIATE DESTINATION
Note: See TracBrowser for help on using the browser.