Ticket #227: patch
File patch, 25.6 KB (added by , 17 years ago) |
---|
-
docs/replication_protocol.rst
1 .. Copyright (C) 2008 Lemur Consulting Ltd 2 3 ==================================== 4 Xapian Database Replication Protocol 5 ==================================== 6 7 .. contents:: Table of contents 8 9 This document contains details of the implementation of the replication 10 protocol, version 1. For details of how and why to use the replication 11 protocol, see the separate `Replication Users Guide <replication.html>`_ 12 document. 13 14 Protocol description 15 ==================== 16 17 FIXME - describe the protocol used to transfer the updates. 18 19 20 Changeset files 21 =============== 22 23 Changes are represented by changeset files. When changeset logging is enabled 24 for a flint database, just before each commit a changeset file is created in 25 the database directory. This file contains a record of the changes made, 26 currently in the following format (but note that this format may change between 27 implementations of flint): 28 29 - 12 bytes holding the string "FlintChanges" (used to check that a file is a 30 changeset file). 31 32 - The format of the changeset (as a variable length unsigned integer). 33 34 - The revision number of the database before the changes were applied (as a 35 variable length unsigned integer). 36 37 - The revision number of the database after the changes were applied (as a 38 variable length unsigned integer). 39 40 - A byte: 41 42 - 0 if the changes can safely be applied to a live database 43 44 - 1 if the changes cannot be applied while searching is in progress. (This 45 will be set if the database was modified in "DANGEROUS" mode). 46 47 - 2 if the changes contain a whole database copy (or at least, a copy of all 48 active blocks in a database), in which case the changes should be used to 49 make a brand new database. 50 51 - A series of items: 52 53 - A byte: 0 if there are no more items in the changeset, 1 if the next item 54 is a base file, 2 if the next item is a list of blocks. 55 56 - A string, holding a table name. (preceded by the length of the string as 57 a variable length unsigned integer) 58 59 - If a base file: 60 61 - The letter of the base file (currently 'A' or 'B'). 62 63 - The length of the file (as a variable length unsigned integer). 64 65 - The contents of the base file. 66 67 - If a list of blocks: 68 69 - The blocksize in use. 70 71 - A list of items: 72 73 - A variable length unsigned integer holding 0 if the list is at an end, 74 or holding (block number + 1) otherwise. 75 76 - The contents of the block. 77 78 - A revision number that the database must be upgraded to, with more 79 changesets, before it is safe to be made live. This will normally be the 80 revision number of the database after the changes were applied, but if the 81 changeset was created by copying parts of the database without a read lock, 82 and modifications were made to the database while the copy was in progress, 83 parts of the copy may contain later revisions than intended - in this 84 situation futher changesets will be needed to ensure that these parts of the 85 database are fully integrated. -
docs/replication.rst
1 .. Copyright (C) 2008 Lemur Consulting Ltd 2 3 ======================================= 4 Xapian Database Replication Users Guide 5 ======================================= 6 7 .. contents:: Table of contents 8 9 Introduction 10 ============ 11 12 It is often desirable to maintain multiple copies of a Xapian database, having 13 a "master" database which modifications are made on, and a set of secondary 14 (read-only, "slave") databases which these modifications propagate to. For 15 example, to support a high query load there may be many search servers, each 16 with a local copy of the database, and a single indexing server. In order to 17 allow scaling to a large number of search servers, with large databases and 18 frequent updates, we need an database replication implementation to have the 19 following characteristics: 20 21 - Data transfer is (at most) proportional to the size of the updates, rather 22 than the size of the database, to allow frequent small updates to large 23 databases to be replicated efficiently. 24 25 - Searching (on the slave databases) and indexing (on the master database) can 26 continue during synchronisation. 27 28 - Data cached (in memory) on the slave databases is not discarded (unless it's 29 actually out of date) as updates arrive, to ensure that searches continue to 30 be performed quickly during and after updates. 31 32 - Synchronising each slave database involves low overhead (both IO and CPU) on 33 the server holding the master database, so that many slaves can be updated 34 from a single master without overloading it. 35 36 - Database synchronisation can be recovered after network outages or server 37 failures without manual intervention and without excessive data transfer. 38 39 The database replication protocol is intended to support replicating a single 40 writable database to multiple (read-only) search servers, while satisfying all 41 of the above properties. It is not intended to support replication of multiple 42 writable databases - there must always be a single master database to which all 43 modifications are made. 44 45 This document gives an overview of how and why to use the replication protocol. 46 For technical details of the implementation of the replication protocol, see 47 the separate `Replication Protocol <replication_protocol.html>`_ document. 48 49 Setting up replicated databases 50 =============================== 51 52 FIXME - describe how to set up a set of replicated databases. 53 54 55 Alternative approaches 56 ====================== 57 58 Without using the database replication protocol, there are various ways in 59 which the "single master, multiple slaves" setup could be implemented. 60 61 - Copy database from master to all slaves after each update, then swap the new 62 database for the old. 63 64 - Synchronise databases from the master to the on the slaves using rsync. 65 66 - Keep copy of database on master from before each update, and use a binary 67 diff algorithm (e.g., xdelta) to calculate the changes, and then apply these 68 same changes to the databases on each slave. 69 70 - Serve database from master to slaves over NFS (or other remote file system). 71 72 - Use the "remote database backend" facility of Xapian to allow slave servers 73 to search the database directly on the master. 74 75 All of these could be made to work but have various drawbacks, and fail to 76 satisfy all the desired characteristics. Let's examine them in detail: 77 78 Copying database after each update 79 ---------------------------------- 80 81 Databases could be pushed to the slaves after each update simply by copying the 82 entire database from the master (using scp, ftp, http or one of the many other 83 transfer options). After the copy is completed, the new database would be made 84 live (perhaps by symlink switching, if symlinks are available). After a 85 reasonable interval to allow searches in progress on the old database to 86 complete, the old database would be removed. (On UNIX filesystems, the old 87 database could be unlinked immediately, and the resources used by it would be 88 automatically freed as soon as the current searches using it complete.) 89 90 This approach has the advantage of simplicity, and also ensures that the 91 databases can be correctly re-synchronised after network outages or hardware 92 failure. 93 94 However, this approach would involve copying a large amount of data for each 95 update, however small the update was. Also, because the search server would 96 have to switch to access new files each time an update was pushed, the search 97 server will be likely to experience poor performance due to commonly accessed 98 pages falling out of the disk cache during the update. In particular, although 99 some of the newly pushed data would be likely to be in the cache immediately 100 after the update, if the combination of the old and new database sizes exceeds 101 the size of the memory available on the search servers for caching, either some 102 of the live database will be dropped from the cache resulting in poor 103 performance during the update, or some of the new database will not initially 104 be present in the cache after update. 105 106 Synchronise database using rsync 107 -------------------------------- 108 109 Rsync works by calculating hashes for the content on the client and the server, 110 sending the hashes from the client to the server, and then calculating (on the 111 server) which pieces of the file need to be sent to update the client. This 112 results in a fairly low amount of network traffic, but puts a fairly high CPU 113 load on the server. This would result in a large load being placed on the 114 master server if a large number of slaves tried to synchronise with it. 115 116 Also, rsync will not reliably update the database in a manner which allows the 117 database on a slave to be searched while being updated - therefore, a copy or 118 snapshot of the database would need to be taken first to allow searches to 119 continue (accessing the copy) while the database is being synchronised. 120 121 If a copy is used, the caching problems discussed in the previous section would 122 apply again. If a snapshotting filesystem is used, it may be possible to take 123 a read-only snapshot copy cheaply (and without encountering poor caching 124 behaviour), but filesystems with support for this are not always available, and 125 may require considerable effort to set up even if they are available. 126 127 Use a binary diff algorithm 128 --------------------------- 129 130 If a copy of the database on the master before the update was kept, a binary 131 diff algorithm (such as "xdelta") could be used to compare the old and new 132 versions of the database. This would produce a patch file which could be 133 transferred to the slaves, and then applied - avoiding the need for specific 134 calculations to be performed for each slave. 135 136 However, this requires a copy or snapshot to be taken on the master - which has 137 the same problems as previously discussed. A copy or snapshot would also need 138 to be taken on the slave, since a patch from xdelta couldn't safely be applied 139 to a live database. 140 141 Serve database from master to slaves over NFS 142 --------------------------------------------- 143 144 NFS allows a section of a filesystem to be exported to a remote host. Xapian 145 is quite capable of searching a database which is exported in such a manner, 146 and thus NFS can be used to quickly and easily share a database from the master 147 to multiple slaves. 148 149 A reasonable setup might be to use a powerful machine with a fast disk as the 150 master, and use that same machine as an NFS server. Then, multiple slaves can 151 connect to that NFS server for searching the database. This setup is quite 152 convenient, because it separates the indexing workload from the search workload 153 to a reasonable extent, but may lead to performance problems. 154 155 There are two main problems which are likely to be encountered. Firstly, in 156 order to work efficiently, NFS clients (or the OS filesystem layer above NFS) 157 cache information read from the remote file system in memory. If there is 158 insufficient memory available to cache the whole database in memory, searches 159 will occasionally need to access parts of the database which are held only on 160 the master server. Such searches will take a long time to complete, because 161 the round-trip time for an access to a disk block on the master is typically a 162 lot slower than the round-trip time for access to a local disk. Additionally, 163 if the local network experiences problems, or the master server fails (or gets 164 overloaded due to all the search requests), the searches will be unable to be 165 completed. 166 167 Also, when a file is modified, the NFS protocol has no way of indicating that 168 only a small set of blocks in the file have been modified. The caching is all 169 implemented by NFS clients, which can do little other than check the file 170 modification time periodically, and invalidate all cached blocks for the file 171 if the modification time has changed. For the Linux client, the time between 172 checks can be configured by setting the acregmin and acregmax mount options, 173 but whatever these are set to, the whole file will be dropped from the cache 174 when any modification is found. 175 176 This means that, after every update to the database on the master, searches on 177 the slaves will have to fetch all the blocks required for their search across 178 the network, which will likely result in extremely slow search times until the 179 cache on the slaves gets populated properly again. 180 181 Use the "remote database backend" facility 182 ------------------------------------------ 183 184 Xapian has supported a "remote" database backend since the very early days of 185 the project. This allows a search to be run against a database on a remote 186 machine, which may seem to be exactly what we want. However, the "remote" 187 database backend works by performing most of the work for a search on the 188 remote end - in the situation we're concerned with, this would mean that most 189 of the work was performed on the master, while slaves remain largely idle. 190 191 The "remote" database backend is intended to allow a large database to be 192 split, at the document level, between multiple hosts. This allows systems to 193 be built which search a very large database with some degree of parallelism 194 (and thus provide faster individual searches than a system searching a single 195 database locally). In contrast, the database replication protocol is intended 196 to allow a database to be copied to multiple machines to support a high 197 concurrent search load (and thus to allow a higher throughput of searches). 198 199 In some cases (i.e., a very large database and a high concurrent search load) 200 it may be perfectly reasonable to use both the database replication protocol in 201 conjunction with the "remote" database backend to get both of these advantages 202 - the two systems solve different problems. 203 -
docs/Makefile.am
11 11 bm25.html code_structure.html queryparser.html \ 12 12 quickstartexpand.cc.html quickstartindex.cc.html quickstartsearch.cc.html 13 13 14 RSTDOCS = admin_notes.rst deprecation.rst glossary.rst sorting.rst \ 15 spelling.rst synonyms.rst termgenerator.rst valueranges.rst 14 RSTDOCS = admin_notes.rst deprecation.rst glossary.rst \ 15 replication.rst replication_protocol.rst \ 16 sorting.rst spelling.rst synonyms.rst termgenerator.rst valueranges.rst 16 17 RSTHTML = $(RSTDOCS:.rst=.html) 17 18 18 19 # Files which should be put in the distribution by automake -
bin/xapian-compact.cc
633 633 } 634 634 } 635 635 tmpout.push_back(dest); 636 tmptab.commit(1 );636 tmptab.commit(1, -1); 637 637 } 638 638 swap(tmp, tmpout); 639 639 swap(off, newoff); … … 941 941 } 942 942 943 943 // Commit as revision 1. 944 out.commit(1 );944 out.commit(1, -1); 945 945 946 946 cout << '\r' << t->name << ": "; 947 947 off_t out_size = 0; -
backends/flint/flint_spelling.h
99 99 FlintTable::set_block_size(blocksize); 100 100 } 101 101 102 void commit(flint_revision_number_t revision ) {102 void commit(flint_revision_number_t revision, int changes_fd) { 103 103 merge_changes(); 104 FlintTable::commit(revision );104 FlintTable::commit(revision, changes_fd); 105 105 } 106 106 107 107 void cancel() { -
backends/flint/flint_table.cc
1689 1689 base_.set_block_size(block_size_); 1690 1690 base_.set_have_fakeroot(true); 1691 1691 base_.set_sequential(true); 1692 base_.write_to_file(name + "baseA" );1692 base_.write_to_file(name + "baseA", -1); 1693 1693 1694 1694 /* remove the alternative base file, if any */ 1695 1695 sys_unlink_if_exists(name + "baseB"); … … 1727 1727 } 1728 1728 1729 1729 void 1730 FlintTable::commit(flint_revision_number_t revision )1730 FlintTable::commit(flint_revision_number_t revision, int changes_fd) 1731 1731 { 1732 1732 DEBUGCALL(DB, void, "FlintTable::commit", revision); 1733 1733 Assert(writable); … … 1789 1789 C[i].rewrite = false; 1790 1790 } 1791 1791 1792 base.write_to_file(name + "base" + char(base_letter) );1792 base.write_to_file(name + "base" + char(base_letter), changes_fd); 1793 1793 base.commit(); 1794 1794 1795 1795 read_root(); … … 1799 1799 seq_count = SEQ_START_POINT; 1800 1800 } 1801 1801 1802 1802 1803 void 1804 FlintTable::write_changed_blocks(int changes_fd) 1805 { 1806 Assert(changes_fd >= 0); 1807 1808 // Compare the old and new bitmaps to find blocks which have changed, and 1809 // write them to the file descriptor. 1810 1811 uint4 n = 0; 1812 while (base.find_changed_block(&n)) { 1813 // Write block n to the file. 1814 1815 n += 1; 1816 } 1817 } 1818 1819 void 1803 1820 FlintTable::cancel() 1804 1821 { 1805 1822 DEBUGCALL(DB, void, "FlintTable::cancel", ""); -
backends/flint/flint_table.h
330 330 * be greater than the latest revision number (see 331 331 * get_latest_revision_number()), or an exception will be 332 332 * thrown. 333 * 334 * @param changes_fd The file descriptor to write changes to. If -1, 335 * no changes will be written. 333 336 */ 334 void commit(flint_revision_number_t revision );337 void commit(flint_revision_number_t revision, int changes_fd); 335 338 339 /** Append the list of blocks changed to a changeset file. 340 * 341 * @param changes_fd The file descriptor to write changes to. 342 */ 343 void write_changed_blocks(int changes_fd); 344 336 345 /** Cancel any outstanding changes. 337 346 * 338 347 * This will discard any modifications which haven't been committed -
backends/flint/flint_btreebase.h
86 86 } 87 87 88 88 /** Write the btree base file to disk. */ 89 void write_to_file(const std::string &filename );89 void write_to_file(const std::string &filename, int changes_fd); 90 90 91 91 /* Methods dealing with the bitmap */ 92 92 /** true iff block n was free at the start of the transaction on … … 98 98 99 99 uint4 next_free_block(); 100 100 101 /** Find the first changed block at or after position *n. 102 * 103 * Returns true if such a block was found, or false otherwise. 104 */ 105 bool find_changed_block(uint4 * n); 106 101 107 bool block_free_now(uint4 n); 102 108 103 109 void calculate_last_block(); -
backends/flint/flint_database.cc
4 4 * Copyright 2001 Hein Ragas 5 5 * Copyright 2002 Ananova Ltd 6 6 * Copyright 2002,2003,2004,2005,2006,2007 Olly Betts 7 * Copyright 2006 Richard Boulton7 * Copyright 2006,2008 Lemur Consulting Ltd 8 8 * 9 9 * This program is free software; you can redistribute it and/or 10 10 * modify it under the terms of the GNU General Public License as … … 27 27 #include <xapian/error.h> 28 28 29 29 #include "safeerrno.h" 30 #ifdef __WIN32__ 31 # include "msvc_posix_wrapper.h" 32 #endif 30 33 34 #include "flint_io.h" 31 35 #include "flint_database.h" 32 36 #include "utils.h" 33 37 #include "omdebug.h" … … 67 71 // in the term). 68 72 #define MAX_SAFE_TERM_LENGTH 245 69 73 74 // Magic string used to recognise a changeset file. 75 #define CHANGES_MAGIC_STRING "FlintChanges" 76 77 // The current version of changeset files. 78 // 1 - initial implementation 79 #define CHANGES_VERSION 1u 80 70 81 // Magic key in the postlist table (which corresponds to an invalid docid) is 71 82 // used to store the next free docid and total length of all documents. 72 83 static const string METAINFO_KEY("", 1); 73 84 85 /** Delete file, throwing an error if we can't delete it (but not if it 86 * doesn't exist). 87 */ 88 static void 89 sys_unlink_if_exists(const string & filename) 90 { 91 #ifdef __WIN32__ 92 if (msvc_posix_unlink(filename.c_str()) == -1) { 93 #else 94 if (unlink(filename) == -1) { 95 #endif 96 if (errno == ENOENT) return; 97 throw Xapian::DatabaseError("Can't delete file: `" + filename + 98 "': " + strerror(errno)); 99 } 100 } 101 74 102 /* This finds the tables, opens them at consistent revisions, manages 75 103 * determining the current and next revision numbers, and stores handles 76 104 * to the tables. … … 337 365 FlintDatabase::set_revision_number(flint_revision_number_t new_revision) 338 366 { 339 367 DEBUGCALL(DB, void, "FlintDatabase::set_revision_number", new_revision); 340 postlist_table.commit(new_revision); 341 position_table.commit(new_revision); 342 termlist_table.commit(new_revision); 343 value_table.commit(new_revision); 344 synonym_table.commit(new_revision); 345 spelling_table.commit(new_revision); 346 record_table.commit(new_revision); 368 369 int changes_fd = -1; 370 string changes_name; 371 372 if (1) { // FIXME - toggle writing of changesets somehow. 373 changes_name = db_dir + "/changes" + om_tostring(new_revision); 374 #ifdef __WIN32__ 375 changes_fd = msvc_posix_open(changes_name.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_BINARY); 376 #else 377 changes_fd = open(changes_name.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0666); 378 #endif 379 if (changes_fd < 0) { 380 string message = string("Couldn't open changeset ") 381 + changes_name + " to write: " + strerror(errno); 382 throw Xapian::DatabaseOpeningError(message); 383 } 384 } 385 386 try { 387 fdcloser closefd(changes_fd); 388 if (changes_fd >= 0) { 389 string buf; 390 flint_revision_number_t old_revision = 391 record_table.get_open_revision_number(); 392 buf += CHANGES_MAGIC_STRING; 393 buf += pack_uint(CHANGES_VERSION); 394 buf += pack_uint(old_revision); 395 buf += pack_uint(new_revision); 396 397 // FIXME - if DANGEROUS mode is in use, this should contain pack_uint(1u) 398 buf += pack_uint(0u); // Changes can be applied to a live database. 399 400 flint_io_write(changes_fd, buf.data(), buf.size()); 401 402 // Write the changes to the blocks in the tables. Do the postlist 403 // table last, so that ends up cached the most, if the cache 404 // available is limited. Do the position and value tables just 405 // before that, because they're also critical to search speed. 406 termlist_table.write_changed_blocks(changes_fd); 407 synonym_table.write_changed_blocks(changes_fd); 408 spelling_table.write_changed_blocks(changes_fd); 409 record_table.write_changed_blocks(changes_fd); 410 position_table.write_changed_blocks(changes_fd); 411 value_table.write_changed_blocks(changes_fd); 412 postlist_table.write_changed_blocks(changes_fd); 413 } 414 415 postlist_table.commit(new_revision, changes_fd); 416 position_table.commit(new_revision, changes_fd); 417 termlist_table.commit(new_revision, changes_fd); 418 value_table.commit(new_revision, changes_fd); 419 synonym_table.commit(new_revision, changes_fd); 420 spelling_table.commit(new_revision, changes_fd); 421 record_table.commit(new_revision, changes_fd); 422 423 if (changes_fd >= 0) { 424 string buf; 425 buf += '\0'; 426 buf += pack_uint(new_revision); 427 flint_io_write(changes_fd, buf.data(), buf.size()); 428 // FIXME - should really be calling flint_io_sync() on the changes 429 // file before writing the new base for the record table, to ensure 430 // that the changes file doesn't get lost (or, worse, left only 431 // partially recorded on disk) if the system fails after the sync 432 // of the record table, but before the sync of the changes file. 433 flint_io_sync(changes_fd); 434 } 435 } catch(...) { 436 // Remove the changeset, if there was one. 437 if (changes_fd >= 0) { 438 sys_unlink_if_exists(changes_name); 439 } 440 441 throw; 442 } 347 443 } 348 444 349 445 void -
backends/flint/flint_synonym.h
102 102 FlintTable::set_block_size(blocksize); 103 103 } 104 104 105 void commit(flint_revision_number_t revision ) {105 void commit(flint_revision_number_t revision, int changes_fd) { 106 106 merge_changes(); 107 FlintTable::commit(revision );107 FlintTable::commit(revision, changes_fd); 108 108 } 109 109 110 110 void cancel() { -
backends/flint/flint_btreebase.cc
277 277 } 278 278 279 279 void 280 FlintTable_base::write_to_file(const string &filename )280 FlintTable_base::write_to_file(const string &filename, int changes_fd) 281 281 { 282 282 calculate_last_block(); 283 283 … … 310 310 } 311 311 fdcloser closefd(h); 312 312 313 if (changes_fd >= 0) { 314 string changesbuf; 315 changesbuf += pack_uint(1u); // Indicates that this item is a base file. 316 changesbuf += filename[filename.size() - 1]; // The letter of the base file. 317 changesbuf += pack_uint(buf.size()); 318 flint_io_write(changes_fd, changesbuf.data(), changesbuf.size()); 319 flint_io_write(changes_fd, buf.data(), buf.size()); 320 } 321 313 322 flint_io_write(h, buf.data(), buf.size()); 314 323 flint_io_sync(h); 315 324 } … … 412 421 return n; 413 422 } 414 423 424 415 425 bool 426 FlintTable_base::find_changed_block(uint4 * n) 427 { 428 // Search for a block which was free at the start of the transaction, but 429 // isn't now. 430 (void) n; 431 // FIXME - implement 432 return false; 433 } 434 435 bool 416 436 FlintTable_base::block_free_now(uint4 n) 417 437 { 418 438 uint4 i = n / CHAR_BIT;