root / tags / 1.0.8 / xapian-core / docs / quartzdesign.html

Revision 9545, 30.0 kB (checked in by olly, 14 months ago)

docs/quartzdesign.html: Note that Quartz is now deprecated.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
Line 
1<HTML>
2<HEAD>
3<TITLE>The Quartz Database Backend</TITLE>
4</HEAD>
5<BODY BGCOLOR="white" TEXT="black">
6
7<h1><center>The Quartz database backend</center></h1>
8
9<p>
10Note: use of Quartz is now deprecated in favour of the newer Flint disk-based
11format.  We plan to entirely remove the quartz backend in Xapian 1.1.0.
12However a lot of this document is also relevant to Flint as well.
13</p>
14
15<h2>Why Quartz?</h2>
16
17<em>
18What is this thing called Quartz?  How does it fit in with the Xapian
19library?</em>
20<p>
21Xapian can access information stored in various different formats.
22Generally these are disk-based, but there's also the InMemory format which is
23stored entirely in memory.
24
25<p>
26Each of these formats is comprised by a set of classes providing an interface
27to a Database object and several other related objects (PostList, TermList,
28etc...).
29
30<p>
31Quartz is simply the name of Xapian's first high-performance backend.
32The design of Quartz draws on all our past experience to satisfy the following
33criteria:
34<ul>
35<li>
36  Fast and scalable searches.
37</li>
38<li>
39  May be updated (ie, database doesn't have to be built from scratch in order
40  to make a single change).
41</li>
42<li>
43  May be modified whilst searches are in progress.
44</li>
45<li>
46  Provides atomic updates in the face of interruption at any point.
47</li>
48<li>
49  Provides a single-writer, multiple-reader environment.
50</li>
51</ul>
52
53<p>
54Different backends can be optionally compiled into the Xapian library
55(by specifying appropriate options to the configure script).  Quartz
56is compiled by default.
57
58<p>
59<em>
60Why do we call it Quartz - where does the name come from?
61</em>
62<p>Well, we had to call it something, and Quartz was
63simply the first name we came up with which we thought we could live with...
64
65<h2>Tables</h2>
66
67A Quartz database consists of several tables, each of which stores a
68different type of information: for example, one table stores the user-defined
69data associated with each document, and another table stores the posting
70lists (the lists of documents which particular terms occur in).
71<p>
72These tables consist of a set of key-tag pairs, which I shall often
73refer to these as <em>items</em> or <em>entries</em>.  Items may
74be accessed randomly by specifying a key and reading the item
75pointed to, or in sorted order by creating a cursor pointing to a
76particular item.  The sort order is a lexicographical ordering based
77on the contents of the keys.  Only one instance of a key may exist in
78a single table - inserting a second item with the same key as an existing
79item will overwrite the existing item.
80<p>
81Positioning of cursors may be performed even when a
82full key isn't known, by attempting to access an item which doesn't
83exist: the cursor will then be set to point to the first item with a
84key before that requested.
85<p>
86The Btree class defines the standard interface to a table.  This has
87a subclass for each table - QuartzRecordTable, QuartzValueTable,
88QuartzPostListTable, QuartzPositionListTable, and QuartzTermListTable.
89Apart from QuartzPostListTable, these are fairly thin wrappers.
90QuartzPostListTable buffers up the inverted changes internally to
91allow fast updatig.
92<p>
93Changes are made to the Btree by calling add() and del(), but they will not be
94seen by readers until commit() is called.  Alternatively, calling cancel() will
95abandon changes.  This allows atomic transactions to be implemented.
96<p>
97The Btree class is optimised to be fast when changes are applied in sorted
98order.  For most tables, this means indexing documents in docid order.
99QuartzPostListTable takes care of this as part of the inversion process.
100
101<h2>The contents of the tables</h2>
102
103We shall worry about the implementation of the tables later, but first
104we shall look at what is stored within each table.
105<p>
106There are five tables comprising a quartz database.
107<ul>
108<li>
109<B>Record</B>.
110This stores the arbitrary chunk of data associated with each document.
111<p>
112Key: lsb ... msb of the docid, until all remaining bytes are zero
113<p>
114The record table also holds a couple of special values, stored under the key
115consisting of a single zero byte (this isn't a valid encoded docid).  The first
116value is the next
117document ID to use when adding a document (document IDs are allocated in
118increasing order, starting at 1, and are currently never reused).  The other
119value is the total length of the documents in the database, which
120is used to calculate the average document length, which we need to
121calculate normalised document lengths.
122<p>
123</li><li>
124<B>Value</B>.
125This stores a set of values for each document
126<p>
127Key: lsb ... msb of the docid, until all remaining bytes are zero
128<p>
129Currently, there is one B-tree entry for each document in the database
130that has one or more values associated with it.  This entry
131consists of a list of value_no-s and values for that document.
132<p>
133An
134alternative implementation would be to store an item for each value, whose
135key is a combination of the document ID and the keyno, and whose tag is the
136value.
137Which implementation is better depends on the access pattern: if a document
138is being passed across a network link, all the values for a document
139are read - if a document is being dealt with locally, usually only some of
140the values will be read.
141<p>
142Documents will usually have very few values, so the current
143implementation may actually be the most suitable.
144<p>
145</li>
146<li>
147<b>TermList</b>.
148This stores the list of terms which appear in a document.
149<p>
150Key: lsb ... msb of the docid, until all remaining bytes are zero
151<p>
152The list first stores the document length, and the number of entries in the
153termlist (this latter value is stored for quick access - it could also be
154determined by running through the termlist).  It then stores a set of
155entries: each entry in the list consists of a term (as a string), and the
156wdf (within document frequency - how many times the term appears in the
157document) of that term.
158<p>
159In a non-modifiable database, the term frequency could be stored in the
160termlist for each entry in each list.  This would enable query expansion
161operations to occur significantly faster by avoiding the need for a large
162number of extra lookups - however, this cannot be implemented in a
163writable database without causing any modifications to modify a very large
164proportion of the database.
165<p>
166</li><li>
167<b>PositionList</b>.
168For each (term,&nbsp;document) pair, this stores the list of positions in the
169document at which the term occurs.
170<p>
171Key: pack_uint(did) + tname
172<p>
173</li><li>
174<b>PostList</b>.
175This stores the list of documents in which each term appears.  Each entry
176in the table is a chunk of entries in the posting list for the term.
177<p>
178Key: pack_string_preserving_sort(tname) [first chunk]
179<br>
180Key: pack_string_preserving_sort(tname) + pack_uint_preserving_sort(first_did_in_chunk) [second and subsequent chunks]
181<p>
182</li>
183</ul>
184
185<h2>Representation of integers, strings, etc</h2>
186
187It is well known that in modern computers there are many, many CPU cycles
188for each disk read, or even memory read.  It is therefore important to
189minimise disk reads, and can be advantageous to do so even at the expense of
190a large amount of computation.  In other words, <em>Compression is
191good</em>.
192<p>
193The current implementation uses simple compression - we're investigating
194more effective schemes - these are (FIXME: this is slightly out of date
195now):
196<ul>
197<li>
198In posting lists, successive document IDs are stored as a difference
199which is compressed using a byte-wise huffman encoding (so it's stored
200in 7, 14, 21, 28, ... bits):
201<ol>
202<li>
203First byte: if integer is &lt; 128, store integer, otherwise store
204integer modulo 128, but with top bit set.
205</li><li>
206Shift integer right 7 places.
207</li><li>
208Second byte: if integer is &lt; 128, store integer, otherwise store
209integer modulo 128, but with top bit set.
210</li><li>
211Shift integer right 7 places.
212</li><li>
213etc...
214</li>
215</ol>
216<p>
217</li><li>
218In position lists, successive positions are encoded similarly.
219<p>
220</li><li>
221In termlists, terms are stored as string values in sorted order.  The term
222names are compressed by storing differences between consecutive terms.
223<p>
224</li>
225</ul>
226
227<h2>PostLists and chunks</h2>
228
229Posting lists can grow to be very large - some terms occur in a very large
230proportion of the documents, and their posting lists can represent a
231significant fraction of the size of the whole database.  Therefore, we do
232not wish to read an entire posting list into memory at once.  (Indeed, we'd
233rather only read a small portion of it at all, but that's a different
234story - see the documentation on <A HREF="matcherdesign.html">optimisations
235performed by the matcher</A>).
236<p>
237To deal with this, we store posting lists in small chunks, each the right
238size to be stored in a single B-tree block, and hence to be accessed with a
239minimal amount of disk latency.
240<p>
241The key for the first chunk in a posting list is the term name of the term
242whose posting list it is.  The key in subsequent chunks is the term name
243followed by the document ID of the first document in the chunk.  This
244allows the cursor methods to be used to scan through the chunks in order,
245and also to jump to the chunk containing a particular document ID.
246
247<p>
248It is quite possible that the termlists and
249position lists would benefit from being split into chunks in this way.
250
251<h2>All document lists</h2>
252
253It is possible to use the Xapian API to obtain a list of all documents in the
254database.  This is done by creating a special postinglist.  This functionality
255was added after the file structure in use by Quartz was frozen, and it is
256unfortunately impossible to implement efficiently for Quartz.
257
258The problem is that it is not possible to read the list of documents in sorted
259order direct from disk - instead, the list is read into memory to be sorted.
260For databases which do not have sparse document IDs, this should not use much
261memory since the list is kept in memory in a range-compressed form (but does
262require an iteration over the entirety of one of the tables of the Quartz
263database - no skipping can be done in this case.)  This is unlikely to be
264fixed, since we don't believe it can be without changing Quartz's structure.
265In any case, it is not a priority since Quartz is no longer the default
266backend.
267
268<h2>Btree implementation</h2>
269
270The tables are currently all implemented as B-trees (actually a form of
271B-tree sometimes known as a B+ tree).
272(For some tables, the use of a different structure could be more appropriate -
273perhaps a hashing scheme might provide more space and time efficient access.
274This is an area for future investigation).
275<p>
276A B-tree is a fairly standard structure for storing this kind of data, so I
277will not describe it in detail - see a reference book on database design and
278algorithms for that.  The essential points are that it is a block-based
279multiply branching tree structure, storing keys in the internal blocks and
280key-tag pairs in the leaf blocks.
281<p>
282Our implementation is fairly standard, except for its revision scheme,
283which allows modifications to be applied atomically whilst other processes
284are reading the database.  This scheme involves copying each block in the
285tree which is involved in a modification, rather than modifying it in
286place, so that a complete new tree structure is built up whilst the old
287structure is unmodified (although this new structure will typically share a
288large number of blocks with the old structure).  The modifications can then
289be atomically applied by writing the new root block and making it active.
290<p>
291After a modification is applied successfully, the old version of the
292table is still fully intact, and can be accessed.  The old version only
293becomes invalid when a second modification is attempted (and it becomes
294invalid whether or not that second modification succeeds).
295<p>
296There is no need for a process which is writing the database to know
297whether any processes are reading previous versions of the database.  As long
298as only one update is performed before the reader closes (or reopens) the
299database, no problem will occur.  If more than one update occurs whilst
300the table is still open, the reader will notice that the database has been
301changed whilst it has been reading it by comparing a revision number stored
302at the start of each block with the revision number it was expecting.  An
303appropriate action can then be taken (for example, to reopen the database
304and repeat the operation).
305<p>
306An alternative approach would be to obtain a read-lock on the revision
307being accessed.  A write would then have to wait until no read-locks
308existed on the old revision before modifying the database.
309
310<h2>Applying changes to all the tables simultaneously</h2>
311
312To recap, we have tables storing key/tag pairs, we can update these,
313and we can then call a
314method and have all the modifications applied to the table atomically.
315Unfortunately, we need more than that - we need to be able to apply
316modifications as a single atomic transaction across multiple tables, so
317that the tables are always accessed in a mutually consistent state.
318<p>
319The revisioning scheme described earlier comes to the rescue!  By carefully
320making sure that we open all the tables at the same revision, and by ensuring
321that at least one such consistent revision always exists, we can extend the
322scope of atomicity to cover all the tables.  In detail:
323
324<ul>
325<li>
326When opening a database, we open each table in a specified order: lets call
327the tables <em>A</em>, <em>B</em>, <em>C</em>, <em>D</em>, <em>E</em>,
328and say we open them in alphabetical order.
329</li><li>
330When opening a database, after opening the first table, <em>A</em>, at the
331newest available revision, we read its revision number.  We then open all
332the other tables at the same revision number.
333</li><li>
334When we commit changes to a database, we commit the tables in reverse order
335- <em>E</em> first, and <em>A</em> last.
336</li>
337</ul>
338
339This scheme guarantees that modifications are atomic across all the tables
340- essentially we have made the modification get committed only when the
341final table is committed.
342
343<h2>Items to be added to this document</h2>
344
345<ul>
346<li>
347Describe that postlists must be stored in sorted order, for boolean queries,
348so cannot store in reverse wdf order for efficiency.  A possible workaround is
349to store the postlists in two or more chunks, ordered by wdf, and to access
350them in this order.
351</li>
352<li>
353An better explanation of why there will always be a consistent set of table
354versions using the scheme described above.
355</li>
356<li>
357Mention that future versions will allow the database creator to decide
358whether to store certain levels of detail in the database - eg, whether to
359store document lengths, term frequencies in the termlists, document lengths
360in the posting lists, etc.
361</li>
362<li>
363Add comment about the inversion process - we are essentially doing a text
364based partitioning scheme.
365</li>
366<li>
367Mention adding documents: usual to add with a new document ID which is
368greater than any currently in the system.  This means that new postings
369get added to the end of posting lists (so we make it easy to get to the
370end of a posting list quickly).  Also, we key position lists by document
371ID first and termname second, so that new positionlists get added to the
372end of the positionlist database, meaning that hardly any blocks will need
373to be altered in this database: data just gets added to the end.
374</li>
375</ul>
376
377<h2>Endnote</h2>
378
379The system as described could, no doubt, be improved in several ways.
380If you can think of such ways then suggest it to us,
381so we can have a discussion of the improvement to see whether it would
382help: if it would we will add it to the design (and eventually the
383code) - if not, we'll add a discussion about it to this document.
384
385<h1><center>The Btree Implementation</center></h1>
386
387I'm not sure about the name 'Btree' that runs through all this, since the fact
388that it is all implemented as a B-tree is surely irrelevant. I have not been
389able to think of a better name though ...
390
391<P>
392Some of the constants mentioned below depend upon a byte being 8 bits, but this
393assumption is not built into the code.
394
395<H2>Keys and tags</H2>
396
397Thinking of 'byte' having type 'unsigned char', a key and a tag are both
398sequences of bytes. The B-tree is a repository for key-tag pairs. A key can be
399looked up to find its corresponding tag. If a key is deleted, the corresponding
400tag is deleted. And in the B-tree keys are unique, so if a key-tag pair is
401added in and the key is already in the Btree, the tag to be added in replaces
402the tag in the B-tree.
403
404<P>In the B-tree key-tag pairs are ordered, and the order is the ASCII collating
405order of the keys. Very precisely, if key1 and key2 point to keys with lengths
406key1_len, key2_len, key1 is before/equal/after key2 according as the following
407procedure returns a value less than, equal to or greater than 0,
408
409<pre>
410static int compare_keys(const byte * key1, int key1_len,
411                        const byte * key2, int key2_len)
412{
413    int smaller = key1_len &lt; key2_len ? key1_len : key2_len;
414    for (int i = 0; i &lt; smaller; i++) {
415        int diff = (int) key1[i] - key2[i];
416        if (diff != 0) return diff;
417    }
418    return key1_len - key2_len;
419}
420</pre>
421
422<P>[This is okay, but none of the code fragments below have been checked.]
423
424<P>
425Any large-scale operation on the B-tree will run very much faster when the keys
426have been sorted into ASCII collating order. This fact is critical to the
427performance of the B-tree software.
428
429<P>
430A key-tag pair is called an 'item'. The B-tree consists therefore of a list of
431items, ordered by their keys:
432
433<pre>
434    I<sub>1</sub>  I<sub>2</sub>  ...  I<sub>j-1</sub>  I<sub>j</sub>  I<sub>j+1</sub>  ...  I<sub>n-1</sub>  I<sub>n</sub>
435</pre>
436
437<P>
438Item I<sub>j</sub> has a 'previous' item, I<sub>j-1</sub>, and a 'next' item, I<sub>j+1</sub>.
439
440<P>
441When the B-tree is created, a single item is added in with null key and null
442tag. This is the 'null item'. The null item may be searched for, and it's
443possible, although perhaps not useful, to replace the tag part of the null
444item. But the null item cannot be deleted, and an attempt to do so is merely
445ignored.
446
447<P>
448A key must not exceed 252 bytes in length.
449
450<P>
451A tag may have length zero. There is an upper limit on the length of a tag, but
452it is quite high. Roughly, the tag is divided into items of size L - kl, where
453L is a a few bytes less than a quarter of the block size, and kl is length of
454its key. You can then have 64K such items. So even with a block size as low as
4552K and key length as large as 100, you could have a tag of 2.5 megabytes. More
456realistically, with a 16K block size, the upper limit on the tag size is about
457256 megabytes.
458
459<H2>Revision numbers</H2>
460
461<P>The B-tree has a revision number, and each time it is updated, the revision
462number increases. In a single transaction on the B-tree, it is first opened,
463its revision number, R is found, updates are made, and then the B-tree is
464closed with a supplied revision number. The supplied revision number will
465typically be R+1, but any R+k is possible, where k &gt; 0.
466
467<P>
468If this sequence fails to complete for some reason, revision R+k of the B-tree
469will not, of course, be brought into existence. But revision R will still
470exist, and it is that version of the B-tree that will be the starting point for
471later revisions.
472
473<P>
474If this sequence runs to a successful termination, the new revision, R+k,
475supplants the old revision, R. But it is still possible to open the B-tree at
476revision R. After a successful revision of the B-tree, in fact, it will have
477two valid versions: the current one, revision R+k, and the old one, revision R.
478
479<P>
480You might want to go back to the old revision of a B-tree if it is being
481updated in tandem with second B-tree, and the update on the second B-tree
482fails. Suppose B1 and B2 are two such B-trees. B1 is opened and its latest
483revision number is found to be R1. B2 is opened and its latest revision number
484is found to be R2. If R1 &gt; R2, it must be the case that the previous
485transaction on B1 succeeded and the previous transaction on B2 failed. Then B1
486needs to opened at its previous revision number, which must be R1.
487
488<P>
489The calls using revision numbers described below are intended to handle this
490type of contingency.
491
492<H2>The files</H2>
493
494The B-tree has three associated files. DB contains the data proper of the
495B-tree. The revision numbers, other administrative information, and a bitmap
496are held in two files, baseA and baseB.
497
498<P>
499When the B-tree is opened without any particular
500revision number being specified, the later of baseA and baseB is chosen as the
501opening base, and as soon as a write to the file DB occurs, the earlier of
502baseA or baseB is deleted. On closure, the new revision number is written to
503baseB if baseA was the opening base, and to baseA if baseB was the opening
504base. If the B-tree update fails for some reason, only one base will usually
505survive.
506
507<P>The bitmap stored in each base file will have bit n set if block n is in use
508in the corresponding revision of the B-tree.
509
510<H2>The API</H2>
511
512See the <a href="http://www.xapian.org/docs/sourcedoc/html/classBtree.html">doxygen generated documentation</a> for a description of the API of the Btree
513class and related classes.
514
515<h2>Checking the B-tree</h2>
516
517The following static method is provided in btreecheck.h:
518
519<pre>
520void BtreeCheck::check(const string &amp; name, int opts);
521</pre>
522<!-- FIXME there's an optional ostream & argument (default output is to cout) -->
523
524<blockquote>
525    BtreeCheck::check(s, opts) is essentially equivalent to
526
527<pre>
528        Btree B(s, false);
529        B.open();
530        {
531            // do a complete integrity check of the B-tree,
532            // reporting according to the bitmask opts
533        }
534</pre>
535
536    The option bitmask may consist of any of the following values |-ed together:
537<ul>
538  <li> OPT_SHORT_TREE - short summary of entire B-tree
539  <li> OPT_FULL_TREE - full summary of entire B-tree
540  <li> OPT_SHOW_BITMAP - print the bitmap
541  <li> OPT_SHOW_STATS - print the basic information (revision number, blocksize etc.)
542</ul>
543
544    The options control what is reported - the entire B-tree is always checked
545    as well as reporting the information.
546    <!-- keep this for quartzcheck docs:
547    if non-null, causes information to go to stdout. The
548    following characters may appear in the option string:
549
550<pre>
551        t   - short summary of entire B-tree
552        f   - full summary of entire B-tree
553        b   - print the bitmap
554        v   - print the basic information (revision number, blocksize etc.)
555        +   - equivalent to tbv
556        ?   - lists currently available options
557</pre>
558
559    The options cause a side-effect of printing, so Btree_check(s, "v") checks
560    the entire B-tree and reports basic information, rather than merely
561    reporting the basic information.
562    -->
563</blockquote>
564
565<h2>Full compaction</h2>
566
567As the B-tree grows, items are added into blocks. When a block is full, it
568splits into two (amoeba-like) and one of the new blocks accommodates the new
569entry. Blocks are therefore between 50% and 100% full during growth, or 75% full
570on average.
571
572<p>
573Let us say an item is 'new' if it is presented for addition to the B-tree and
574its key is not already in the B-tree. Then presenting a long run of new items
575ordered by key causes the B-tree updating process to switch into a mode where
576much higher compaction than 75% is achieved - about 90%. This is called
577'sequential' mode. It is possible to force an even higher compaction rate with
578the procedure
579
580
581<pre>
582void Btree::full_compaction(bool parity);
583</pre>
584
585So
586
587<pre>
588    B.full_compaction(true);
589</pre>
590
591switches full compaction on, and
592
593<pre>
594    B.full_compaction(false);
595</pre>
596
597switches it off. Full compaction may be switched on or off at any time, but
598it only affects the compaction rate of sequential mode. In sequential mode, full
599compaction gives around 98-99% block usage - it is not quite 100% because keys
600are not split across blocks.
601
602<p>
603The downside of full compaction is that block splitting will be heavy on the
604next update. However, if a B-tree is created with no intention of being updated,
605full compaction is very desirable.
606
607<h2>Full compaction with revision 1</h2>
608
609Retrieval mode is faster when the B-tree has revision number 1 than for higher
610revision numbers. This is because there are no unused blocks in the B-tree and
611the blocks are in a special order, and this enables the Bcursor::prev and
612Bcursor::next procedures, and the other procedures which use them implicitly,
613to have more efficient forms.
614
615<p>
616To make a really fast structure for retrieval therefore, create a new B-tree,
617open it for updating, set full compaction mode, and add all the items in a
618single transaction, sorted on keys. After closing, do not update further.
619<!--
620Further updates can be prevented quite easily by deleting (or moving) the bitmap
621files. These are required in update mode but ignored in retrieval mode.
622-->
623
624<p>
625Xapian includes a utility which performs this process on all the Btrees
626in a quartz database - it's call quartzcompact. You can
627refer to the <a href="http://www.xapian.org/docs/sourcedoc/html/quartzcompact_8cc-source.html">source code of the quartzcompact utility</a> to see how this
628is implemented.
629
630<h2>quartzcompact</h2>
631
632quartzcompact takes two arguments - the path of the database to compact, and
633a path to write the compacted version to.
634
635<p>
636Only the Btree structure is changed - all the keys and tags are unaltered, so
637the database is the same as far as an application using Xapian is concerned.
638In particular, all the document ids are the same.
639
640<h2>Notes on space requirements</h2>
641
642The level of the B-tree is the distance of the root block from a leaf block. At
643minimum this is zero. If a B-tree has level L and block size B, then update
644mode requires space for 2(LB + b<sub>1</sub> + b<sub>2</sub>) bytes,
645where b<sub>1</sub> and b<sub>2</sub> are the size of
646the two bitmap files. Of course, L, b<sub>1</sub> and b<sub>2</sub>
647may grow during an update on the
648B-tree. If the revision number is greater than one, then retrieval mode
649requires (L - 2 + 2c)B bytes, where c is the number of active cursors. If
650however the revision number is one, it only requires (L - 2 + c)B bytes.
651
652<p>
653This may change in the future with code redesign, but meanwhile note that a K
654term query that needs k &lt;= K cursors open at once to process, will demand
6552*K*B bytes of memory in the B-tree manager.
656
657<h2>Updating during retrieval</h2>
658
659The B-tree cannot be updated by two separate processes at the same time. The
660user of the B-tree software should establish a locking mechanism to ensure that
661this never happens.
662
663<p>
664It is possible to do retrieval while the B-tree is being updated. If the
665updating process overwrites a part of the B-tree required by the retrieval
666process, then a Xapian::DatabaseModifiedError exception is thrown.
667
668<p>
669This should be handled, and suitable action taken - either the operation
670aborted, or the Btree reopened at the latest revision and the operation
671retried.  Here is a model scheme:
672
673<pre>
674static Btree * reopen(Btree * B)
675{
676    // Get the revision number. This will return the correct value, even when
677    // B-&gt;overwritten is detected during opening.
678    uint4 revision = B-&gt;get_open_revision_number();
679
680    while (true) {
681        try {
682            delete B;  /* close the B-tree */
683            B = new Btree(s, true);
684            B-&gt;open(s); /* and reopen */
685            break;
686        } catch (const Xapian::DatabaseModifiedError &amp;) {
687        }
688    }
689
690    if (revision == B-&gt;get_open_revision_number()) {
691        // The revision number ought to have gone up from last time,
692        // so if we arrive here, something has gone badly wrong ...
693        printf("Possible database corruption!\n");
694        exit(1);
695    }
696    return B;
697}
698
699
700    ....
701
702    char * s = "database/";
703    Btree * B = 0;
704    uint4 revision = 0;
705
706    /* open the B-tree */
707    while (true) {
708        try {
709            delete B;  /* close the B-tree */
710            B = new Btree(s, true);
711            B-&gt;open(); /* and reopen */
712            break;
713        } catch (const Xapian::DatabaseModifiedError &amp;) {
714        }
715    }
716
717    string t;
718    while (true) {
719        try {
720            B-&gt;find_tag("brunel", &amp;t); /* look up some keyword */
721            break;
722        } catch (const Xapian::DatabaseModifiedError &amp;) {
723            B = reopen(s);
724        }
725    }
726
727    ...
728</pre>
729
730If the overwritten condition were detected in updating mode, this would mean
731that there were two updating processes at work, or the database has become
732corrupted somehow.  If this is detected, a Xapian::DatabaseCorruptError is
733thrown.  There's not much which can usefully be done to automatically handle
734this condition.
735
736<p>
737In retrieval mode, the following can cause Xapian::DatabaseModifiedError to
738be thrown:
739
740<pre>
741    Btree::open_to_read(name);
742    Btree::open_to_read(name, revision);
743    Bcursor::next();
744    Bcursor::prev();
745    Bcursor::find_key(const string &amp;key);
746    Bcursor::get_tag(string * tag);
747</pre>
748
749The following can not:
750
751<pre>
752   Bcursor::Bcursor(Btree * B);
753   Bcursor::get_key(string * key);
754</pre>
755
756Note particularly that opening the B-tree can cause it, but
757Bcursor::get_key(..) can't.
758
759<!--
760<h2>Error conditions</h2>
761
762Note: this section is out of date - many methods now return errors in different
763ways to those this section indicates.
764
765<P>
766The procedures described above report errors in two ways. (A) A non-zero
767result. Btree::close() returns an int result which is 0 if successful,
768otherwise an error number. (B) The error is placed in B-&gt;error, where B is the
769Btree structure used in the call, or the Btree structure from which the Bcursor
770structure used in the call derives. Then B-&gt;error == 0 means no error,
771otherwise it is a positive number (greater than 2) giving the error number.
772
773<p>
774Some procedures cannot give an error.  Here is a summary:
775
776<pre>
777    Error method  procedure
778    (A)(B)          error condition given by:
779    ===================================================
780        *  n = Btree::find_key(key)
781        *  n = Btree::find_tag(key, kt)
782        *  n = Btree::add(key, tag)
783        *  n = Btree::del(key)
784        *  B = Btree::open_to_write(s)
785        *  B = Btree::open_to_write(s, rev)
786     *     n = Btree::close(B, rev)
787     *     n = Btree::create(s, block_size)   [throws exception]
788        *  B = Btree::open_to_read(s)
789        *  B = Btree::open_to_read(s, rev)
790               Bcursor::Bcursor()
791        *  n = Bcursor::find_key(key)
792        *  n = Bcursor::next()
793        *  n = Bcursor::prev()
794        *  n = Bcursor::get_key(kt)
795        *  n = Bcursor::get_tag(kt)
796               Btree::full_compaction(parity)
797
798    (A) non-zero result (B) B.error == true
799</pre>
800-->
801<!-- FOOTER $Author$ $Date$ $Id$ -->
802</BODY>
803</HTML>
Note: See TracBrowser for help on using the browser.