| 1 | <HTML> |
|---|
| 2 | <HEAD> |
|---|
| 3 | <TITLE>The Quartz Database Backend</TITLE> |
|---|
| 4 | </HEAD> |
|---|
| 5 | <BODY BGCOLOR="white" TEXT="black"> |
|---|
| 6 | |
|---|
| 7 | <h1><center>The Quartz database backend</center></h1> |
|---|
| 8 | |
|---|
| 9 | <p> |
|---|
| 10 | Note: use of Quartz is now deprecated in favour of the newer Flint disk-based |
|---|
| 11 | format. We plan to entirely remove the quartz backend in Xapian 1.1.0. |
|---|
| 12 | However a lot of this document is also relevant to Flint as well. |
|---|
| 13 | </p> |
|---|
| 14 | |
|---|
| 15 | <h2>Why Quartz?</h2> |
|---|
| 16 | |
|---|
| 17 | <em> |
|---|
| 18 | What is this thing called Quartz? How does it fit in with the Xapian |
|---|
| 19 | library?</em> |
|---|
| 20 | <p> |
|---|
| 21 | Xapian can access information stored in various different formats. |
|---|
| 22 | Generally these are disk-based, but there's also the InMemory format which is |
|---|
| 23 | stored entirely in memory. |
|---|
| 24 | |
|---|
| 25 | <p> |
|---|
| 26 | Each of these formats is comprised by a set of classes providing an interface |
|---|
| 27 | to a Database object and several other related objects (PostList, TermList, |
|---|
| 28 | etc...). |
|---|
| 29 | |
|---|
| 30 | <p> |
|---|
| 31 | Quartz is simply the name of Xapian's first high-performance backend. |
|---|
| 32 | The design of Quartz draws on all our past experience to satisfy the following |
|---|
| 33 | criteria: |
|---|
| 34 | <ul> |
|---|
| 35 | <li> |
|---|
| 36 | Fast and scalable searches. |
|---|
| 37 | </li> |
|---|
| 38 | <li> |
|---|
| 39 | May be updated (ie, database doesn't have to be built from scratch in order |
|---|
| 40 | to make a single change). |
|---|
| 41 | </li> |
|---|
| 42 | <li> |
|---|
| 43 | May be modified whilst searches are in progress. |
|---|
| 44 | </li> |
|---|
| 45 | <li> |
|---|
| 46 | Provides atomic updates in the face of interruption at any point. |
|---|
| 47 | </li> |
|---|
| 48 | <li> |
|---|
| 49 | Provides a single-writer, multiple-reader environment. |
|---|
| 50 | </li> |
|---|
| 51 | </ul> |
|---|
| 52 | |
|---|
| 53 | <p> |
|---|
| 54 | Different backends can be optionally compiled into the Xapian library |
|---|
| 55 | (by specifying appropriate options to the configure script). Quartz |
|---|
| 56 | is compiled by default. |
|---|
| 57 | |
|---|
| 58 | <p> |
|---|
| 59 | <em> |
|---|
| 60 | Why do we call it Quartz - where does the name come from? |
|---|
| 61 | </em> |
|---|
| 62 | <p>Well, we had to call it something, and Quartz was |
|---|
| 63 | simply the first name we came up with which we thought we could live with... |
|---|
| 64 | |
|---|
| 65 | <h2>Tables</h2> |
|---|
| 66 | |
|---|
| 67 | A Quartz database consists of several tables, each of which stores a |
|---|
| 68 | different type of information: for example, one table stores the user-defined |
|---|
| 69 | data associated with each document, and another table stores the posting |
|---|
| 70 | lists (the lists of documents which particular terms occur in). |
|---|
| 71 | <p> |
|---|
| 72 | These tables consist of a set of key-tag pairs, which I shall often |
|---|
| 73 | refer to these as <em>items</em> or <em>entries</em>. Items may |
|---|
| 74 | be accessed randomly by specifying a key and reading the item |
|---|
| 75 | pointed to, or in sorted order by creating a cursor pointing to a |
|---|
| 76 | particular item. The sort order is a lexicographical ordering based |
|---|
| 77 | on the contents of the keys. Only one instance of a key may exist in |
|---|
| 78 | a single table - inserting a second item with the same key as an existing |
|---|
| 79 | item will overwrite the existing item. |
|---|
| 80 | <p> |
|---|
| 81 | Positioning of cursors may be performed even when a |
|---|
| 82 | full key isn't known, by attempting to access an item which doesn't |
|---|
| 83 | exist: the cursor will then be set to point to the first item with a |
|---|
| 84 | key before that requested. |
|---|
| 85 | <p> |
|---|
| 86 | The Btree class defines the standard interface to a table. This has |
|---|
| 87 | a subclass for each table - QuartzRecordTable, QuartzValueTable, |
|---|
| 88 | QuartzPostListTable, QuartzPositionListTable, and QuartzTermListTable. |
|---|
| 89 | Apart from QuartzPostListTable, these are fairly thin wrappers. |
|---|
| 90 | QuartzPostListTable buffers up the inverted changes internally to |
|---|
| 91 | allow fast updatig. |
|---|
| 92 | <p> |
|---|
| 93 | Changes are made to the Btree by calling add() and del(), but they will not be |
|---|
| 94 | seen by readers until commit() is called. Alternatively, calling cancel() will |
|---|
| 95 | abandon changes. This allows atomic transactions to be implemented. |
|---|
| 96 | <p> |
|---|
| 97 | The Btree class is optimised to be fast when changes are applied in sorted |
|---|
| 98 | order. For most tables, this means indexing documents in docid order. |
|---|
| 99 | QuartzPostListTable takes care of this as part of the inversion process. |
|---|
| 100 | |
|---|
| 101 | <h2>The contents of the tables</h2> |
|---|
| 102 | |
|---|
| 103 | We shall worry about the implementation of the tables later, but first |
|---|
| 104 | we shall look at what is stored within each table. |
|---|
| 105 | <p> |
|---|
| 106 | There are five tables comprising a quartz database. |
|---|
| 107 | <ul> |
|---|
| 108 | <li> |
|---|
| 109 | <B>Record</B>. |
|---|
| 110 | This stores the arbitrary chunk of data associated with each document. |
|---|
| 111 | <p> |
|---|
| 112 | Key: lsb ... msb of the docid, until all remaining bytes are zero |
|---|
| 113 | <p> |
|---|
| 114 | The record table also holds a couple of special values, stored under the key |
|---|
| 115 | consisting of a single zero byte (this isn't a valid encoded docid). The first |
|---|
| 116 | value is the next |
|---|
| 117 | document ID to use when adding a document (document IDs are allocated in |
|---|
| 118 | increasing order, starting at 1, and are currently never reused). The other |
|---|
| 119 | value is the total length of the documents in the database, which |
|---|
| 120 | is used to calculate the average document length, which we need to |
|---|
| 121 | calculate normalised document lengths. |
|---|
| 122 | <p> |
|---|
| 123 | </li><li> |
|---|
| 124 | <B>Value</B>. |
|---|
| 125 | This stores a set of values for each document |
|---|
| 126 | <p> |
|---|
| 127 | Key: lsb ... msb of the docid, until all remaining bytes are zero |
|---|
| 128 | <p> |
|---|
| 129 | Currently, there is one B-tree entry for each document in the database |
|---|
| 130 | that has one or more values associated with it. This entry |
|---|
| 131 | consists of a list of value_no-s and values for that document. |
|---|
| 132 | <p> |
|---|
| 133 | An |
|---|
| 134 | alternative implementation would be to store an item for each value, whose |
|---|
| 135 | key is a combination of the document ID and the keyno, and whose tag is the |
|---|
| 136 | value. |
|---|
| 137 | Which implementation is better depends on the access pattern: if a document |
|---|
| 138 | is being passed across a network link, all the values for a document |
|---|
| 139 | are read - if a document is being dealt with locally, usually only some of |
|---|
| 140 | the values will be read. |
|---|
| 141 | <p> |
|---|
| 142 | Documents will usually have very few values, so the current |
|---|
| 143 | implementation may actually be the most suitable. |
|---|
| 144 | <p> |
|---|
| 145 | </li> |
|---|
| 146 | <li> |
|---|
| 147 | <b>TermList</b>. |
|---|
| 148 | This stores the list of terms which appear in a document. |
|---|
| 149 | <p> |
|---|
| 150 | Key: lsb ... msb of the docid, until all remaining bytes are zero |
|---|
| 151 | <p> |
|---|
| 152 | The list first stores the document length, and the number of entries in the |
|---|
| 153 | termlist (this latter value is stored for quick access - it could also be |
|---|
| 154 | determined by running through the termlist). It then stores a set of |
|---|
| 155 | entries: each entry in the list consists of a term (as a string), and the |
|---|
| 156 | wdf (within document frequency - how many times the term appears in the |
|---|
| 157 | document) of that term. |
|---|
| 158 | <p> |
|---|
| 159 | In a non-modifiable database, the term frequency could be stored in the |
|---|
| 160 | termlist for each entry in each list. This would enable query expansion |
|---|
| 161 | operations to occur significantly faster by avoiding the need for a large |
|---|
| 162 | number of extra lookups - however, this cannot be implemented in a |
|---|
| 163 | writable database without causing any modifications to modify a very large |
|---|
| 164 | proportion of the database. |
|---|
| 165 | <p> |
|---|
| 166 | </li><li> |
|---|
| 167 | <b>PositionList</b>. |
|---|
| 168 | For each (term, document) pair, this stores the list of positions in the |
|---|
| 169 | document at which the term occurs. |
|---|
| 170 | <p> |
|---|
| 171 | Key: pack_uint(did) + tname |
|---|
| 172 | <p> |
|---|
| 173 | </li><li> |
|---|
| 174 | <b>PostList</b>. |
|---|
| 175 | This stores the list of documents in which each term appears. Each entry |
|---|
| 176 | in the table is a chunk of entries in the posting list for the term. |
|---|
| 177 | <p> |
|---|
| 178 | Key: pack_string_preserving_sort(tname) [first chunk] |
|---|
| 179 | <br> |
|---|
| 180 | Key: pack_string_preserving_sort(tname) + pack_uint_preserving_sort(first_did_in_chunk) [second and subsequent chunks] |
|---|
| 181 | <p> |
|---|
| 182 | </li> |
|---|
| 183 | </ul> |
|---|
| 184 | |
|---|
| 185 | <h2>Representation of integers, strings, etc</h2> |
|---|
| 186 | |
|---|
| 187 | It is well known that in modern computers there are many, many CPU cycles |
|---|
| 188 | for each disk read, or even memory read. It is therefore important to |
|---|
| 189 | minimise disk reads, and can be advantageous to do so even at the expense of |
|---|
| 190 | a large amount of computation. In other words, <em>Compression is |
|---|
| 191 | good</em>. |
|---|
| 192 | <p> |
|---|
| 193 | The current implementation uses simple compression - we're investigating |
|---|
| 194 | more effective schemes - these are (FIXME: this is slightly out of date |
|---|
| 195 | now): |
|---|
| 196 | <ul> |
|---|
| 197 | <li> |
|---|
| 198 | In posting lists, successive document IDs are stored as a difference |
|---|
| 199 | which is compressed using a byte-wise huffman encoding (so it's stored |
|---|
| 200 | in 7, 14, 21, 28, ... bits): |
|---|
| 201 | <ol> |
|---|
| 202 | <li> |
|---|
| 203 | First byte: if integer is < 128, store integer, otherwise store |
|---|
| 204 | integer modulo 128, but with top bit set. |
|---|
| 205 | </li><li> |
|---|
| 206 | Shift integer right 7 places. |
|---|
| 207 | </li><li> |
|---|
| 208 | Second byte: if integer is < 128, store integer, otherwise store |
|---|
| 209 | integer modulo 128, but with top bit set. |
|---|
| 210 | </li><li> |
|---|
| 211 | Shift integer right 7 places. |
|---|
| 212 | </li><li> |
|---|
| 213 | etc... |
|---|
| 214 | </li> |
|---|
| 215 | </ol> |
|---|
| 216 | <p> |
|---|
| 217 | </li><li> |
|---|
| 218 | In position lists, successive positions are encoded similarly. |
|---|
| 219 | <p> |
|---|
| 220 | </li><li> |
|---|
| 221 | In termlists, terms are stored as string values in sorted order. The term |
|---|
| 222 | names are compressed by storing differences between consecutive terms. |
|---|
| 223 | <p> |
|---|
| 224 | </li> |
|---|
| 225 | </ul> |
|---|
| 226 | |
|---|
| 227 | <h2>PostLists and chunks</h2> |
|---|
| 228 | |
|---|
| 229 | Posting lists can grow to be very large - some terms occur in a very large |
|---|
| 230 | proportion of the documents, and their posting lists can represent a |
|---|
| 231 | significant fraction of the size of the whole database. Therefore, we do |
|---|
| 232 | not wish to read an entire posting list into memory at once. (Indeed, we'd |
|---|
| 233 | rather only read a small portion of it at all, but that's a different |
|---|
| 234 | story - see the documentation on <A HREF="matcherdesign.html">optimisations |
|---|
| 235 | performed by the matcher</A>). |
|---|
| 236 | <p> |
|---|
| 237 | To deal with this, we store posting lists in small chunks, each the right |
|---|
| 238 | size to be stored in a single B-tree block, and hence to be accessed with a |
|---|
| 239 | minimal amount of disk latency. |
|---|
| 240 | <p> |
|---|
| 241 | The key for the first chunk in a posting list is the term name of the term |
|---|
| 242 | whose posting list it is. The key in subsequent chunks is the term name |
|---|
| 243 | followed by the document ID of the first document in the chunk. This |
|---|
| 244 | allows the cursor methods to be used to scan through the chunks in order, |
|---|
| 245 | and also to jump to the chunk containing a particular document ID. |
|---|
| 246 | |
|---|
| 247 | <p> |
|---|
| 248 | It is quite possible that the termlists and |
|---|
| 249 | position lists would benefit from being split into chunks in this way. |
|---|
| 250 | |
|---|
| 251 | <h2>All document lists</h2> |
|---|
| 252 | |
|---|
| 253 | It is possible to use the Xapian API to obtain a list of all documents in the |
|---|
| 254 | database. This is done by creating a special postinglist. This functionality |
|---|
| 255 | was added after the file structure in use by Quartz was frozen, and it is |
|---|
| 256 | unfortunately impossible to implement efficiently for Quartz. |
|---|
| 257 | |
|---|
| 258 | The problem is that it is not possible to read the list of documents in sorted |
|---|
| 259 | order direct from disk - instead, the list is read into memory to be sorted. |
|---|
| 260 | For databases which do not have sparse document IDs, this should not use much |
|---|
| 261 | memory since the list is kept in memory in a range-compressed form (but does |
|---|
| 262 | require an iteration over the entirety of one of the tables of the Quartz |
|---|
| 263 | database - no skipping can be done in this case.) This is unlikely to be |
|---|
| 264 | fixed, since we don't believe it can be without changing Quartz's structure. |
|---|
| 265 | In any case, it is not a priority since Quartz is no longer the default |
|---|
| 266 | backend. |
|---|
| 267 | |
|---|
| 268 | <h2>Btree implementation</h2> |
|---|
| 269 | |
|---|
| 270 | The tables are currently all implemented as B-trees (actually a form of |
|---|
| 271 | B-tree sometimes known as a B+ tree). |
|---|
| 272 | (For some tables, the use of a different structure could be more appropriate - |
|---|
| 273 | perhaps a hashing scheme might provide more space and time efficient access. |
|---|
| 274 | This is an area for future investigation). |
|---|
| 275 | <p> |
|---|
| 276 | A B-tree is a fairly standard structure for storing this kind of data, so I |
|---|
| 277 | will not describe it in detail - see a reference book on database design and |
|---|
| 278 | algorithms for that. The essential points are that it is a block-based |
|---|
| 279 | multiply branching tree structure, storing keys in the internal blocks and |
|---|
| 280 | key-tag pairs in the leaf blocks. |
|---|
| 281 | <p> |
|---|
| 282 | Our implementation is fairly standard, except for its revision scheme, |
|---|
| 283 | which allows modifications to be applied atomically whilst other processes |
|---|
| 284 | are reading the database. This scheme involves copying each block in the |
|---|
| 285 | tree which is involved in a modification, rather than modifying it in |
|---|
| 286 | place, so that a complete new tree structure is built up whilst the old |
|---|
| 287 | structure is unmodified (although this new structure will typically share a |
|---|
| 288 | large number of blocks with the old structure). The modifications can then |
|---|
| 289 | be atomically applied by writing the new root block and making it active. |
|---|
| 290 | <p> |
|---|
| 291 | After a modification is applied successfully, the old version of the |
|---|
| 292 | table is still fully intact, and can be accessed. The old version only |
|---|
| 293 | becomes invalid when a second modification is attempted (and it becomes |
|---|
| 294 | invalid whether or not that second modification succeeds). |
|---|
| 295 | <p> |
|---|
| 296 | There is no need for a process which is writing the database to know |
|---|
| 297 | whether any processes are reading previous versions of the database. As long |
|---|
| 298 | as only one update is performed before the reader closes (or reopens) the |
|---|
| 299 | database, no problem will occur. If more than one update occurs whilst |
|---|
| 300 | the table is still open, the reader will notice that the database has been |
|---|
| 301 | changed whilst it has been reading it by comparing a revision number stored |
|---|
| 302 | at the start of each block with the revision number it was expecting. An |
|---|
| 303 | appropriate action can then be taken (for example, to reopen the database |
|---|
| 304 | and repeat the operation). |
|---|
| 305 | <p> |
|---|
| 306 | An alternative approach would be to obtain a read-lock on the revision |
|---|
| 307 | being accessed. A write would then have to wait until no read-locks |
|---|
| 308 | existed on the old revision before modifying the database. |
|---|
| 309 | |
|---|
| 310 | <h2>Applying changes to all the tables simultaneously</h2> |
|---|
| 311 | |
|---|
| 312 | To recap, we have tables storing key/tag pairs, we can update these, |
|---|
| 313 | and we can then call a |
|---|
| 314 | method and have all the modifications applied to the table atomically. |
|---|
| 315 | Unfortunately, we need more than that - we need to be able to apply |
|---|
| 316 | modifications as a single atomic transaction across multiple tables, so |
|---|
| 317 | that the tables are always accessed in a mutually consistent state. |
|---|
| 318 | <p> |
|---|
| 319 | The revisioning scheme described earlier comes to the rescue! By carefully |
|---|
| 320 | making sure that we open all the tables at the same revision, and by ensuring |
|---|
| 321 | that at least one such consistent revision always exists, we can extend the |
|---|
| 322 | scope of atomicity to cover all the tables. In detail: |
|---|
| 323 | |
|---|
| 324 | <ul> |
|---|
| 325 | <li> |
|---|
| 326 | When opening a database, we open each table in a specified order: lets call |
|---|
| 327 | the tables <em>A</em>, <em>B</em>, <em>C</em>, <em>D</em>, <em>E</em>, |
|---|
| 328 | and say we open them in alphabetical order. |
|---|
| 329 | </li><li> |
|---|
| 330 | When opening a database, after opening the first table, <em>A</em>, at the |
|---|
| 331 | newest available revision, we read its revision number. We then open all |
|---|
| 332 | the other tables at the same revision number. |
|---|
| 333 | </li><li> |
|---|
| 334 | When we commit changes to a database, we commit the tables in reverse order |
|---|
| 335 | - <em>E</em> first, and <em>A</em> last. |
|---|
| 336 | </li> |
|---|
| 337 | </ul> |
|---|
| 338 | |
|---|
| 339 | This scheme guarantees that modifications are atomic across all the tables |
|---|
| 340 | - essentially we have made the modification get committed only when the |
|---|
| 341 | final table is committed. |
|---|
| 342 | |
|---|
| 343 | <h2>Items to be added to this document</h2> |
|---|
| 344 | |
|---|
| 345 | <ul> |
|---|
| 346 | <li> |
|---|
| 347 | Describe that postlists must be stored in sorted order, for boolean queries, |
|---|
| 348 | so cannot store in reverse wdf order for efficiency. A possible workaround is |
|---|
| 349 | to store the postlists in two or more chunks, ordered by wdf, and to access |
|---|
| 350 | them in this order. |
|---|
| 351 | </li> |
|---|
| 352 | <li> |
|---|
| 353 | An better explanation of why there will always be a consistent set of table |
|---|
| 354 | versions using the scheme described above. |
|---|
| 355 | </li> |
|---|
| 356 | <li> |
|---|
| 357 | Mention that future versions will allow the database creator to decide |
|---|
| 358 | whether to store certain levels of detail in the database - eg, whether to |
|---|
| 359 | store document lengths, term frequencies in the termlists, document lengths |
|---|
| 360 | in the posting lists, etc. |
|---|
| 361 | </li> |
|---|
| 362 | <li> |
|---|
| 363 | Add comment about the inversion process - we are essentially doing a text |
|---|
| 364 | based partitioning scheme. |
|---|
| 365 | </li> |
|---|
| 366 | <li> |
|---|
| 367 | Mention adding documents: usual to add with a new document ID which is |
|---|
| 368 | greater than any currently in the system. This means that new postings |
|---|
| 369 | get added to the end of posting lists (so we make it easy to get to the |
|---|
| 370 | end of a posting list quickly). Also, we key position lists by document |
|---|
| 371 | ID first and termname second, so that new positionlists get added to the |
|---|
| 372 | end of the positionlist database, meaning that hardly any blocks will need |
|---|
| 373 | to be altered in this database: data just gets added to the end. |
|---|
| 374 | </li> |
|---|
| 375 | </ul> |
|---|
| 376 | |
|---|
| 377 | <h2>Endnote</h2> |
|---|
| 378 | |
|---|
| 379 | The system as described could, no doubt, be improved in several ways. |
|---|
| 380 | If you can think of such ways then suggest it to us, |
|---|
| 381 | so we can have a discussion of the improvement to see whether it would |
|---|
| 382 | help: if it would we will add it to the design (and eventually the |
|---|
| 383 | code) - if not, we'll add a discussion about it to this document. |
|---|
| 384 | |
|---|
| 385 | <h1><center>The Btree Implementation</center></h1> |
|---|
| 386 | |
|---|
| 387 | I'm not sure about the name 'Btree' that runs through all this, since the fact |
|---|
| 388 | that it is all implemented as a B-tree is surely irrelevant. I have not been |
|---|
| 389 | able to think of a better name though ... |
|---|
| 390 | |
|---|
| 391 | <P> |
|---|
| 392 | Some of the constants mentioned below depend upon a byte being 8 bits, but this |
|---|
| 393 | assumption is not built into the code. |
|---|
| 394 | |
|---|
| 395 | <H2>Keys and tags</H2> |
|---|
| 396 | |
|---|
| 397 | Thinking of 'byte' having type 'unsigned char', a key and a tag are both |
|---|
| 398 | sequences of bytes. The B-tree is a repository for key-tag pairs. A key can be |
|---|
| 399 | looked up to find its corresponding tag. If a key is deleted, the corresponding |
|---|
| 400 | tag is deleted. And in the B-tree keys are unique, so if a key-tag pair is |
|---|
| 401 | added in and the key is already in the Btree, the tag to be added in replaces |
|---|
| 402 | the tag in the B-tree. |
|---|
| 403 | |
|---|
| 404 | <P>In the B-tree key-tag pairs are ordered, and the order is the ASCII collating |
|---|
| 405 | order of the keys. Very precisely, if key1 and key2 point to keys with lengths |
|---|
| 406 | key1_len, key2_len, key1 is before/equal/after key2 according as the following |
|---|
| 407 | procedure returns a value less than, equal to or greater than 0, |
|---|
| 408 | |
|---|
| 409 | <pre> |
|---|
| 410 | static int compare_keys(const byte * key1, int key1_len, |
|---|
| 411 | const byte * key2, int key2_len) |
|---|
| 412 | { |
|---|
| 413 | int smaller = key1_len < key2_len ? key1_len : key2_len; |
|---|
| 414 | for (int i = 0; i < smaller; i++) { |
|---|
| 415 | int diff = (int) key1[i] - key2[i]; |
|---|
| 416 | if (diff != 0) return diff; |
|---|
| 417 | } |
|---|
| 418 | return key1_len - key2_len; |
|---|
| 419 | } |
|---|
| 420 | </pre> |
|---|
| 421 | |
|---|
| 422 | <P>[This is okay, but none of the code fragments below have been checked.] |
|---|
| 423 | |
|---|
| 424 | <P> |
|---|
| 425 | Any large-scale operation on the B-tree will run very much faster when the keys |
|---|
| 426 | have been sorted into ASCII collating order. This fact is critical to the |
|---|
| 427 | performance of the B-tree software. |
|---|
| 428 | |
|---|
| 429 | <P> |
|---|
| 430 | A key-tag pair is called an 'item'. The B-tree consists therefore of a list of |
|---|
| 431 | items, ordered by their keys: |
|---|
| 432 | |
|---|
| 433 | <pre> |
|---|
| 434 | I<sub>1</sub> I<sub>2</sub> ... I<sub>j-1</sub> I<sub>j</sub> I<sub>j+1</sub> ... I<sub>n-1</sub> I<sub>n</sub> |
|---|
| 435 | </pre> |
|---|
| 436 | |
|---|
| 437 | <P> |
|---|
| 438 | Item I<sub>j</sub> has a 'previous' item, I<sub>j-1</sub>, and a 'next' item, I<sub>j+1</sub>. |
|---|
| 439 | |
|---|
| 440 | <P> |
|---|
| 441 | When the B-tree is created, a single item is added in with null key and null |
|---|
| 442 | tag. This is the 'null item'. The null item may be searched for, and it's |
|---|
| 443 | possible, although perhaps not useful, to replace the tag part of the null |
|---|
| 444 | item. But the null item cannot be deleted, and an attempt to do so is merely |
|---|
| 445 | ignored. |
|---|
| 446 | |
|---|
| 447 | <P> |
|---|
| 448 | A key must not exceed 252 bytes in length. |
|---|
| 449 | |
|---|
| 450 | <P> |
|---|
| 451 | A tag may have length zero. There is an upper limit on the length of a tag, but |
|---|
| 452 | it is quite high. Roughly, the tag is divided into items of size L - kl, where |
|---|
| 453 | L is a a few bytes less than a quarter of the block size, and kl is length of |
|---|
| 454 | its key. You can then have 64K such items. So even with a block size as low as |
|---|
| 455 | 2K and key length as large as 100, you could have a tag of 2.5 megabytes. More |
|---|
| 456 | realistically, with a 16K block size, the upper limit on the tag size is about |
|---|
| 457 | 256 megabytes. |
|---|
| 458 | |
|---|
| 459 | <H2>Revision numbers</H2> |
|---|
| 460 | |
|---|
| 461 | <P>The B-tree has a revision number, and each time it is updated, the revision |
|---|
| 462 | number increases. In a single transaction on the B-tree, it is first opened, |
|---|
| 463 | its revision number, R is found, updates are made, and then the B-tree is |
|---|
| 464 | closed with a supplied revision number. The supplied revision number will |
|---|
| 465 | typically be R+1, but any R+k is possible, where k > 0. |
|---|
| 466 | |
|---|
| 467 | <P> |
|---|
| 468 | If this sequence fails to complete for some reason, revision R+k of the B-tree |
|---|
| 469 | will not, of course, be brought into existence. But revision R will still |
|---|
| 470 | exist, and it is that version of the B-tree that will be the starting point for |
|---|
| 471 | later revisions. |
|---|
| 472 | |
|---|
| 473 | <P> |
|---|
| 474 | If this sequence runs to a successful termination, the new revision, R+k, |
|---|
| 475 | supplants the old revision, R. But it is still possible to open the B-tree at |
|---|
| 476 | revision R. After a successful revision of the B-tree, in fact, it will have |
|---|
| 477 | two valid versions: the current one, revision R+k, and the old one, revision R. |
|---|
| 478 | |
|---|
| 479 | <P> |
|---|
| 480 | You might want to go back to the old revision of a B-tree if it is being |
|---|
| 481 | updated in tandem with second B-tree, and the update on the second B-tree |
|---|
| 482 | fails. Suppose B1 and B2 are two such B-trees. B1 is opened and its latest |
|---|
| 483 | revision number is found to be R1. B2 is opened and its latest revision number |
|---|
| 484 | is found to be R2. If R1 > R2, it must be the case that the previous |
|---|
| 485 | transaction on B1 succeeded and the previous transaction on B2 failed. Then B1 |
|---|
| 486 | needs to opened at its previous revision number, which must be R1. |
|---|
| 487 | |
|---|
| 488 | <P> |
|---|
| 489 | The calls using revision numbers described below are intended to handle this |
|---|
| 490 | type of contingency. |
|---|
| 491 | |
|---|
| 492 | <H2>The files</H2> |
|---|
| 493 | |
|---|
| 494 | The B-tree has three associated files. DB contains the data proper of the |
|---|
| 495 | B-tree. The revision numbers, other administrative information, and a bitmap |
|---|
| 496 | are held in two files, baseA and baseB. |
|---|
| 497 | |
|---|
| 498 | <P> |
|---|
| 499 | When the B-tree is opened without any particular |
|---|
| 500 | revision number being specified, the later of baseA and baseB is chosen as the |
|---|
| 501 | opening base, and as soon as a write to the file DB occurs, the earlier of |
|---|
| 502 | baseA or baseB is deleted. On closure, the new revision number is written to |
|---|
| 503 | baseB if baseA was the opening base, and to baseA if baseB was the opening |
|---|
| 504 | base. If the B-tree update fails for some reason, only one base will usually |
|---|
| 505 | survive. |
|---|
| 506 | |
|---|
| 507 | <P>The bitmap stored in each base file will have bit n set if block n is in use |
|---|
| 508 | in the corresponding revision of the B-tree. |
|---|
| 509 | |
|---|
| 510 | <H2>The API</H2> |
|---|
| 511 | |
|---|
| 512 | See the <a href="http://www.xapian.org/docs/sourcedoc/html/classBtree.html">doxygen generated documentation</a> for a description of the API of the Btree |
|---|
| 513 | class and related classes. |
|---|
| 514 | |
|---|
| 515 | <h2>Checking the B-tree</h2> |
|---|
| 516 | |
|---|
| 517 | The following static method is provided in btreecheck.h: |
|---|
| 518 | |
|---|
| 519 | <pre> |
|---|
| 520 | void BtreeCheck::check(const string & name, int opts); |
|---|
| 521 | </pre> |
|---|
| 522 | <!-- FIXME there's an optional ostream & argument (default output is to cout) --> |
|---|
| 523 | |
|---|
| 524 | <blockquote> |
|---|
| 525 | BtreeCheck::check(s, opts) is essentially equivalent to |
|---|
| 526 | |
|---|
| 527 | <pre> |
|---|
| 528 | Btree B(s, false); |
|---|
| 529 | B.open(); |
|---|
| 530 | { |
|---|
| 531 | // do a complete integrity check of the B-tree, |
|---|
| 532 | // reporting according to the bitmask opts |
|---|
| 533 | } |
|---|
| 534 | </pre> |
|---|
| 535 | |
|---|
| 536 | The option bitmask may consist of any of the following values |-ed together: |
|---|
| 537 | <ul> |
|---|
| 538 | <li> OPT_SHORT_TREE - short summary of entire B-tree |
|---|
| 539 | <li> OPT_FULL_TREE - full summary of entire B-tree |
|---|
| 540 | <li> OPT_SHOW_BITMAP - print the bitmap |
|---|
| 541 | <li> OPT_SHOW_STATS - print the basic information (revision number, blocksize etc.) |
|---|
| 542 | </ul> |
|---|
| 543 | |
|---|
| 544 | The options control what is reported - the entire B-tree is always checked |
|---|
| 545 | as well as reporting the information. |
|---|
| 546 | <!-- keep this for quartzcheck docs: |
|---|
| 547 | if non-null, causes information to go to stdout. The |
|---|
| 548 | following characters may appear in the option string: |
|---|
| 549 | |
|---|
| 550 | <pre> |
|---|
| 551 | t - short summary of entire B-tree |
|---|
| 552 | f - full summary of entire B-tree |
|---|
| 553 | b - print the bitmap |
|---|
| 554 | v - print the basic information (revision number, blocksize etc.) |
|---|
| 555 | + - equivalent to tbv |
|---|
| 556 | ? - lists currently available options |
|---|
| 557 | </pre> |
|---|
| 558 | |
|---|
| 559 | The options cause a side-effect of printing, so Btree_check(s, "v") checks |
|---|
| 560 | the entire B-tree and reports basic information, rather than merely |
|---|
| 561 | reporting the basic information. |
|---|
| 562 | --> |
|---|
| 563 | </blockquote> |
|---|
| 564 | |
|---|
| 565 | <h2>Full compaction</h2> |
|---|
| 566 | |
|---|
| 567 | As the B-tree grows, items are added into blocks. When a block is full, it |
|---|
| 568 | splits into two (amoeba-like) and one of the new blocks accommodates the new |
|---|
| 569 | entry. Blocks are therefore between 50% and 100% full during growth, or 75% full |
|---|
| 570 | on average. |
|---|
| 571 | |
|---|
| 572 | <p> |
|---|
| 573 | Let us say an item is 'new' if it is presented for addition to the B-tree and |
|---|
| 574 | its key is not already in the B-tree. Then presenting a long run of new items |
|---|
| 575 | ordered by key causes the B-tree updating process to switch into a mode where |
|---|
| 576 | much higher compaction than 75% is achieved - about 90%. This is called |
|---|
| 577 | 'sequential' mode. It is possible to force an even higher compaction rate with |
|---|
| 578 | the procedure |
|---|
| 579 | |
|---|
| 580 | |
|---|
| 581 | <pre> |
|---|
| 582 | void Btree::full_compaction(bool parity); |
|---|
| 583 | </pre> |
|---|
| 584 | |
|---|
| 585 | So |
|---|
| 586 | |
|---|
| 587 | <pre> |
|---|
| 588 | B.full_compaction(true); |
|---|
| 589 | </pre> |
|---|
| 590 | |
|---|
| 591 | switches full compaction on, and |
|---|
| 592 | |
|---|
| 593 | <pre> |
|---|
| 594 | B.full_compaction(false); |
|---|
| 595 | </pre> |
|---|
| 596 | |
|---|
| 597 | switches it off. Full compaction may be switched on or off at any time, but |
|---|
| 598 | it only affects the compaction rate of sequential mode. In sequential mode, full |
|---|
| 599 | compaction gives around 98-99% block usage - it is not quite 100% because keys |
|---|
| 600 | are not split across blocks. |
|---|
| 601 | |
|---|
| 602 | <p> |
|---|
| 603 | The downside of full compaction is that block splitting will be heavy on the |
|---|
| 604 | next update. However, if a B-tree is created with no intention of being updated, |
|---|
| 605 | full compaction is very desirable. |
|---|
| 606 | |
|---|
| 607 | <h2>Full compaction with revision 1</h2> |
|---|
| 608 | |
|---|
| 609 | Retrieval mode is faster when the B-tree has revision number 1 than for higher |
|---|
| 610 | revision numbers. This is because there are no unused blocks in the B-tree and |
|---|
| 611 | the blocks are in a special order, and this enables the Bcursor::prev and |
|---|
| 612 | Bcursor::next procedures, and the other procedures which use them implicitly, |
|---|
| 613 | to have more efficient forms. |
|---|
| 614 | |
|---|
| 615 | <p> |
|---|
| 616 | To make a really fast structure for retrieval therefore, create a new B-tree, |
|---|
| 617 | open it for updating, set full compaction mode, and add all the items in a |
|---|
| 618 | single transaction, sorted on keys. After closing, do not update further. |
|---|
| 619 | <!-- |
|---|
| 620 | Further updates can be prevented quite easily by deleting (or moving) the bitmap |
|---|
| 621 | files. These are required in update mode but ignored in retrieval mode. |
|---|
| 622 | --> |
|---|
| 623 | |
|---|
| 624 | <p> |
|---|
| 625 | Xapian includes a utility which performs this process on all the Btrees |
|---|
| 626 | in a quartz database - it's call quartzcompact. You can |
|---|
| 627 | refer to the <a href="http://www.xapian.org/docs/sourcedoc/html/quartzcompact_8cc-source.html">source code of the quartzcompact utility</a> to see how this |
|---|
| 628 | is implemented. |
|---|
| 629 | |
|---|
| 630 | <h2>quartzcompact</h2> |
|---|
| 631 | |
|---|
| 632 | quartzcompact takes two arguments - the path of the database to compact, and |
|---|
| 633 | a path to write the compacted version to. |
|---|
| 634 | |
|---|
| 635 | <p> |
|---|
| 636 | Only the Btree structure is changed - all the keys and tags are unaltered, so |
|---|
| 637 | the database is the same as far as an application using Xapian is concerned. |
|---|
| 638 | In particular, all the document ids are the same. |
|---|
| 639 | |
|---|
| 640 | <h2>Notes on space requirements</h2> |
|---|
| 641 | |
|---|
| 642 | The level of the B-tree is the distance of the root block from a leaf block. At |
|---|
| 643 | minimum this is zero. If a B-tree has level L and block size B, then update |
|---|
| 644 | mode requires space for 2(LB + b<sub>1</sub> + b<sub>2</sub>) bytes, |
|---|
| 645 | where b<sub>1</sub> and b<sub>2</sub> are the size of |
|---|
| 646 | the two bitmap files. Of course, L, b<sub>1</sub> and b<sub>2</sub> |
|---|
| 647 | may grow during an update on the |
|---|
| 648 | B-tree. If the revision number is greater than one, then retrieval mode |
|---|
| 649 | requires (L - 2 + 2c)B bytes, where c is the number of active cursors. If |
|---|
| 650 | however the revision number is one, it only requires (L - 2 + c)B bytes. |
|---|
| 651 | |
|---|
| 652 | <p> |
|---|
| 653 | This may change in the future with code redesign, but meanwhile note that a K |
|---|
| 654 | term query that needs k <= K cursors open at once to process, will demand |
|---|
| 655 | 2*K*B bytes of memory in the B-tree manager. |
|---|
| 656 | |
|---|
| 657 | <h2>Updating during retrieval</h2> |
|---|
| 658 | |
|---|
| 659 | The B-tree cannot be updated by two separate processes at the same time. The |
|---|
| 660 | user of the B-tree software should establish a locking mechanism to ensure that |
|---|
| 661 | this never happens. |
|---|
| 662 | |
|---|
| 663 | <p> |
|---|
| 664 | It is possible to do retrieval while the B-tree is being updated. If the |
|---|
| 665 | updating process overwrites a part of the B-tree required by the retrieval |
|---|
| 666 | process, then a Xapian::DatabaseModifiedError exception is thrown. |
|---|
| 667 | |
|---|
| 668 | <p> |
|---|
| 669 | This should be handled, and suitable action taken - either the operation |
|---|
| 670 | aborted, or the Btree reopened at the latest revision and the operation |
|---|
| 671 | retried. Here is a model scheme: |
|---|
| 672 | |
|---|
| 673 | <pre> |
|---|
| 674 | static Btree * reopen(Btree * B) |
|---|
| 675 | { |
|---|
| 676 | // Get the revision number. This will return the correct value, even when |
|---|
| 677 | // B->overwritten is detected during opening. |
|---|
| 678 | uint4 revision = B->get_open_revision_number(); |
|---|
| 679 | |
|---|
| 680 | while (true) { |
|---|
| 681 | try { |
|---|
| 682 | delete B; /* close the B-tree */ |
|---|
| 683 | B = new Btree(s, true); |
|---|
| 684 | B->open(s); /* and reopen */ |
|---|
| 685 | break; |
|---|
| 686 | } catch (const Xapian::DatabaseModifiedError &) { |
|---|
| 687 | } |
|---|
| 688 | } |
|---|
| 689 | |
|---|
| 690 | if (revision == B->get_open_revision_number()) { |
|---|
| 691 | // The revision number ought to have gone up from last time, |
|---|
| 692 | // so if we arrive here, something has gone badly wrong ... |
|---|
| 693 | printf("Possible database corruption!\n"); |
|---|
| 694 | exit(1); |
|---|
| 695 | } |
|---|
| 696 | return B; |
|---|
| 697 | } |
|---|
| 698 | |
|---|
| 699 | |
|---|
| 700 | .... |
|---|
| 701 | |
|---|
| 702 | char * s = "database/"; |
|---|
| 703 | Btree * B = 0; |
|---|
| 704 | uint4 revision = 0; |
|---|
| 705 | |
|---|
| 706 | /* open the B-tree */ |
|---|
| 707 | while (true) { |
|---|
| 708 | try { |
|---|
| 709 | delete B; /* close the B-tree */ |
|---|
| 710 | B = new Btree(s, true); |
|---|
| 711 | B->open(); /* and reopen */ |
|---|
| 712 | break; |
|---|
| 713 | } catch (const Xapian::DatabaseModifiedError &) { |
|---|
| 714 | } |
|---|
| 715 | } |
|---|
| 716 | |
|---|
| 717 | string t; |
|---|
| 718 | while (true) { |
|---|
| 719 | try { |
|---|
| 720 | B->find_tag("brunel", &t); /* look up some keyword */ |
|---|
| 721 | break; |
|---|
| 722 | } catch (const Xapian::DatabaseModifiedError &) { |
|---|
| 723 | B = reopen(s); |
|---|
| 724 | } |
|---|
| 725 | } |
|---|
| 726 | |
|---|
| 727 | ... |
|---|
| 728 | </pre> |
|---|
| 729 | |
|---|
| 730 | If the overwritten condition were detected in updating mode, this would mean |
|---|
| 731 | that there were two updating processes at work, or the database has become |
|---|
| 732 | corrupted somehow. If this is detected, a Xapian::DatabaseCorruptError is |
|---|
| 733 | thrown. There's not much which can usefully be done to automatically handle |
|---|
| 734 | this condition. |
|---|
| 735 | |
|---|
| 736 | <p> |
|---|
| 737 | In retrieval mode, the following can cause Xapian::DatabaseModifiedError to |
|---|
| 738 | be thrown: |
|---|
| 739 | |
|---|
| 740 | <pre> |
|---|
| 741 | Btree::open_to_read(name); |
|---|
| 742 | Btree::open_to_read(name, revision); |
|---|
| 743 | Bcursor::next(); |
|---|
| 744 | Bcursor::prev(); |
|---|
| 745 | Bcursor::find_key(const string &key); |
|---|
| 746 | Bcursor::get_tag(string * tag); |
|---|
| 747 | </pre> |
|---|
| 748 | |
|---|
| 749 | The following can not: |
|---|
| 750 | |
|---|
| 751 | <pre> |
|---|
| 752 | Bcursor::Bcursor(Btree * B); |
|---|
| 753 | Bcursor::get_key(string * key); |
|---|
| 754 | </pre> |
|---|
| 755 | |
|---|
| 756 | Note particularly that opening the B-tree can cause it, but |
|---|
| 757 | Bcursor::get_key(..) can't. |
|---|
| 758 | |
|---|
| 759 | <!-- |
|---|
| 760 | <h2>Error conditions</h2> |
|---|
| 761 | |
|---|
| 762 | Note: this section is out of date - many methods now return errors in different |
|---|
| 763 | ways to those this section indicates. |
|---|
| 764 | |
|---|
| 765 | <P> |
|---|
| 766 | The procedures described above report errors in two ways. (A) A non-zero |
|---|
| 767 | result. Btree::close() returns an int result which is 0 if successful, |
|---|
| 768 | otherwise an error number. (B) The error is placed in B->error, where B is the |
|---|
| 769 | Btree structure used in the call, or the Btree structure from which the Bcursor |
|---|
| 770 | structure used in the call derives. Then B->error == 0 means no error, |
|---|
| 771 | otherwise it is a positive number (greater than 2) giving the error number. |
|---|
| 772 | |
|---|
| 773 | <p> |
|---|
| 774 | Some procedures cannot give an error. Here is a summary: |
|---|
| 775 | |
|---|
| 776 | <pre> |
|---|
| 777 | Error method procedure |
|---|
| 778 | (A)(B) error condition given by: |
|---|
| 779 | =================================================== |
|---|
| 780 | * n = Btree::find_key(key) |
|---|
| 781 | * n = Btree::find_tag(key, kt) |
|---|
| 782 | * n = Btree::add(key, tag) |
|---|
| 783 | * n = Btree::del(key) |
|---|
| 784 | * B = Btree::open_to_write(s) |
|---|
| 785 | * B = Btree::open_to_write(s, rev) |
|---|
| 786 | * n = Btree::close(B, rev) |
|---|
| 787 | * n = Btree::create(s, block_size) [throws exception] |
|---|
| 788 | * B = Btree::open_to_read(s) |
|---|
| 789 | * B = Btree::open_to_read(s, rev) |
|---|
| 790 | Bcursor::Bcursor() |
|---|
| 791 | * n = Bcursor::find_key(key) |
|---|
| 792 | * n = Bcursor::next() |
|---|
| 793 | * n = Bcursor::prev() |
|---|
| 794 | * n = Bcursor::get_key(kt) |
|---|
| 795 | * n = Bcursor::get_tag(kt) |
|---|
| 796 | Btree::full_compaction(parity) |
|---|
| 797 | |
|---|
| 798 | (A) non-zero result (B) B.error == true |
|---|
| 799 | </pre> |
|---|
| 800 | --> |
|---|
| 801 | <!-- FOOTER $Author$ $Date$ $Id$ --> |
|---|
| 802 | </BODY> |
|---|
| 803 | </HTML> |
|---|