patch on Ticket #227 – Attachment – Xapian

docs/replication_protocol.rst

+.. Copyright (C) 2008 Lemur Consulting Ltd
+====================================
+Xapian Database Replication Protocol
+====================================
+.. contents:: Table of contents
+This document contains details of the implementation of the replication
+protocol, version 1.  For details of how and why to use the replication
+protocol, see the separate `Replication Users Guide <replication.html>`_
+document.
+Protocol description
+====================
+FIXME - describe the protocol used to transfer the updates.
+Changeset files
+===============
+Changes are represented by changeset files.  When changeset logging is enabled
+for a flint database, just before each commit a changeset file is created in
+the database directory.  This file contains a record of the changes made,
+currently in the following format (but note that this format may change between
+implementations of flint):
+ - 12 bytes holding the string "FlintChanges" (used to check that a file is a
+   changeset file).
+ - The format of the changeset (as a variable length unsigned integer).
+ - The revision number of the database before the changes were applied (as a
+   variable length unsigned integer).
+ - The revision number of the database after the changes were applied (as a
+   variable length unsigned integer).
+ - A byte:
+   - 0 if the changes can safely be applied to a live database
+   - 1 if the changes cannot be applied while searching is in progress.  (This
+     will be set if the database was modified in "DANGEROUS" mode).
+   - 2 if the changes contain a whole database copy (or at least, a copy of all
+     active blocks in a database), in which case the changes should be used to
+     make a brand new database.
+ - A series of items:
+   - A byte: 0 if there are no more items in the changeset, 1 if the next item
+     is a base file, 2 if the next item is a list of blocks.
+   - A string, holding a table name.  (preceded by the length of the string as
+     a variable length unsigned integer)
+   - If a base file:
+     - The letter of the base file (currently 'A' or 'B').
+     - The length of the file (as a variable length unsigned integer).
+     - The contents of the base file.
+   - If a list of blocks:
+     - The blocksize in use.
+     - A list of items:
+       - A variable length unsigned integer holding 0 if the list is at an end,
+         or holding (block number + 1) otherwise.
+       - The contents of the block.
+ - A revision number that the database must be upgraded to, with more
+   changesets, before it is safe to be made live.  This will normally be the
+   revision number of the database after the changes were applied, but if the
+   changeset was created by copying parts of the database without a read lock,
+   and modifications were made to the database while the copy was in progress,
+   parts of the copy may contain later revisions than intended - in this
+   situation futher changesets will be needed to ensure that these parts of the
+   database are fully integrated.

docs/replication.rst

+.. Copyright (C) 2008 Lemur Consulting Ltd
+=======================================
+Xapian Database Replication Users Guide
+=======================================
+.. contents:: Table of contents
+Introduction
+============
+It is often desirable to maintain multiple copies of a Xapian database, having
+a "master" database which modifications are made on, and a set of secondary
+(read-only, "slave") databases which these modifications propagate to.  For
+example, to support a high query load there may be many search servers, each
+with a local copy of the database, and a single indexing server.  In order to
+allow scaling to a large number of search servers, with large databases and
+frequent updates, we need an database replication implementation to have the
+following characteristics:
+ - Data transfer is (at most) proportional to the size of the updates, rather
+   than the size of the database, to allow frequent small updates to large
+   databases to be replicated efficiently.
+ - Searching (on the slave databases) and indexing (on the master database) can
+   continue during synchronisation.
+ - Data cached (in memory) on the slave databases is not discarded (unless it's
+   actually out of date) as updates arrive, to ensure that searches continue to
+   be performed quickly during and after updates.
+ - Synchronising each slave database involves low overhead (both IO and CPU) on
+   the server holding the master database, so that many slaves can be updated
+   from a single master without overloading it.
+ - Database synchronisation can be recovered after network outages or server
+   failures without manual intervention and without excessive data transfer.
+The database replication protocol is intended to support replicating a single
+writable database to multiple (read-only) search servers, while satisfying all
+of the above properties.  It is not intended to support replication of multiple
+writable databases - there must always be a single master database to which all
+modifications are made.
+This document gives an overview of how and why to use the replication protocol.
+For technical details of the implementation of the replication protocol, see
+the separate `Replication Protocol <replication_protocol.html>`_ document.
+Setting up replicated databases
+===============================
+FIXME - describe how to set up a set of replicated databases.
+Alternative approaches
+======================
+Without using the database replication protocol, there are various ways in
+which the "single master, multiple slaves" setup could be implemented.
+ - Copy database from master to all slaves after each update, then swap the new
+   database for the old.
+ - Synchronise databases from the master to the on the slaves using rsync.
+ - Keep copy of database on master from before each update, and use a binary
+   diff algorithm (e.g., xdelta) to calculate the changes, and then apply these
+   same changes to the databases on each slave.
+ - Serve database from master to slaves over NFS (or other remote file system).
+ - Use the "remote database backend" facility of Xapian to allow slave servers
+   to search the database directly on the master.
+All of these could be made to work but have various drawbacks, and fail to
+satisfy all the desired characteristics.  Let's examine them in detail:
+Copying database after each update
+----------------------------------
+Databases could be pushed to the slaves after each update simply by copying the
+entire database from the master (using scp, ftp, http or one of the many other
+transfer options).  After the copy is completed, the new database would be made
+live (perhaps by symlink switching, if symlinks are available).  After a
+reasonable interval to allow searches in progress on the old database to
+complete, the old database would be removed.  (On UNIX filesystems, the old
+database could be unlinked immediately, and the resources used by it would be
+automatically freed as soon as the current searches using it complete.)
+This approach has the advantage of simplicity, and also ensures that the
+databases can be correctly re-synchronised after network outages or hardware
+failure.
+However, this approach would involve copying a large amount of data for each
+update, however small the update was.  Also, because the search server would
+have to switch to access new files each time an update was pushed, the search
+server will be likely to experience poor performance due to commonly accessed
+pages falling out of the disk cache during the update.  In particular, although
+some of the newly pushed data would be likely to be in the cache immediately
+after the update, if the combination of the old and new database sizes exceeds
+the size of the memory available on the search servers for caching, either some
+of the live database will be dropped from the cache resulting in poor
+performance during the update, or some of the new database will not initially
+be present in the cache after update.
+Synchronise database using rsync
+--------------------------------
+Rsync works by calculating hashes for the content on the client and the server,
+sending the hashes from the client to the server, and then calculating (on the
+server) which pieces of the file need to be sent to update the client.  This
+results in a fairly low amount of network traffic, but puts a fairly high CPU
+load on the server.  This would result in a large load being placed on the
+master server if a large number of slaves tried to synchronise with it.
+Also, rsync will not reliably update the database in a manner which allows the
+database on a slave to be searched while being updated - therefore, a copy or
+snapshot of the database would need to be taken first to allow searches to
+continue (accessing the copy) while the database is being synchronised.
+If a copy is used, the caching problems discussed in the previous section would
+apply again.  If a snapshotting filesystem is used, it may be possible to take
+a read-only snapshot copy cheaply (and without encountering poor caching
+behaviour), but filesystems with support for this are not always available, and
+may require considerable effort to set up even if they are available.
+Use a binary diff algorithm
+---------------------------
+If a copy of the database on the master before the update was kept, a binary
+diff algorithm (such as "xdelta") could be used to compare the old and new
+versions of the database.  This would produce a patch file which could be
+transferred to the slaves, and then applied - avoiding the need for specific
+calculations to be performed for each slave.
+However, this requires a copy or snapshot to be taken on the master - which has
+the same problems as previously discussed.  A copy or snapshot would also need
+to be taken on the slave, since a patch from xdelta couldn't safely be applied
+to a live database.
+Serve database from master to slaves over NFS
+---------------------------------------------
+NFS allows a section of a filesystem to be exported to a remote host.  Xapian
+is quite capable of searching a database which is exported in such a manner,
+and thus NFS can be used to quickly and easily share a database from the master
+to multiple slaves.
+A reasonable setup might be to use a powerful machine with a fast disk as the
+master, and use that same machine as an NFS server.  Then, multiple slaves can
+connect to that NFS server for searching the database. This setup is quite
+convenient, because it separates the indexing workload from the search workload
+to a reasonable extent, but may lead to performance problems.
+There are two main problems which are likely to be encountered.  Firstly, in
+order to work efficiently, NFS clients (or the OS filesystem layer above NFS)
+cache information read from the remote file system in memory.  If there is
+insufficient memory available to cache the whole database in memory, searches
+will occasionally need to access parts of the database which are held only on
+the master server.  Such searches will take a long time to complete, because
+the round-trip time for an access to a disk block on the master is typically a
+lot slower than the round-trip time for access to a local disk.  Additionally,
+if the local network experiences problems, or the master server fails (or gets
+overloaded due to all the search requests), the searches will be unable to be
+completed.
+Also, when a file is modified, the NFS protocol has no way of indicating that
+only a small set of blocks in the file have been modified.  The caching is all
+implemented by NFS clients, which can do little other than check the file
+modification time periodically, and invalidate all cached blocks for the file
+if the modification time has changed. For the Linux client, the time between
+checks can be configured by setting the acregmin and acregmax mount options,
+but whatever these are set to, the whole file will be dropped from the cache
+when any modification is found.
+This means that, after every update to the database on the master, searches on
+the slaves will have to fetch all the blocks required for their search across
+the network, which will likely result in extremely slow search times until the
+cache on the slaves gets populated properly again.
+Use the "remote database backend" facility
+------------------------------------------
+Xapian has supported a "remote" database backend since the very early days of
+the project.  This allows a search to be run against a database on a remote
+machine, which may seem to be exactly what we want.  However, the "remote"
+database backend works by performing most of the work for a search on the
+remote end - in the situation we're concerned with, this would mean that most
+of the work was performed on the master, while slaves remain largely idle.
+The "remote" database backend is intended to allow a large database to be
+split, at the document level, between multiple hosts.  This allows systems to
+be built which search a very large database with some degree of parallelism
+(and thus provide faster individual searches than a system searching a single
+database locally).  In contrast, the database replication protocol is intended
+to allow a database to be copied to multiple machines to support a high
+concurrent search load (and thus to allow a higher throughput of searches).
+In some cases (i.e., a very large database and a high concurrent search load)
+it may be perfectly reasonable to use both the database replication protocol in
+conjunction with the "remote" database backend to get both of these advantages
+- the two systems solve different problems.

docs/Makefile.am

  bm25.html code_structure.html queryparser.html \
  quickstartexpand.cc.html quickstartindex.cc.html quickstartsearch.cc.html
+RSTDOCS = admin_notes.rst deprecation.rst glossary.rst sorting.rst \
+ spelling.rst synonyms.rst termgenerator.rst valueranges.rst
+RSTDOCS = admin_notes.rst deprecation.rst glossary.rst \
+ replication.rst replication_protocol.rst \
+ sorting.rst spelling.rst synonyms.rst termgenerator.rst valueranges.rst
 RSTHTML = $(RSTDOCS:.rst=.html)
 # Files which should be put in the distribution by automake

bin/xapian-compact.cc


633	633	}
634	634	}
635	635	tmpout.push_back(dest);
636		tmptab.commit(1);
	636	tmptab.commit(1, -1);
637	637	}
638	638	swap(tmp, tmpout);
639	639	swap(off, newoff);
…	…
941	941	}
942	942
943	943	// Commit as revision 1.
944		out.commit(1);
	944	out.commit(1, -1);
945	945
946	946	cout << '\r' << t->name << ": ";
947	947	off_t out_size = 0;

backends/flint/flint_spelling.h


99	99	FlintTable::set_block_size(blocksize);
100	100	}
101	101
102		void commit(flint_revision_number_t revision) {
	102	void commit(flint_revision_number_t revision, int changes_fd) {
103	103	merge_changes();
104		FlintTable::commit(revision);
	104	FlintTable::commit(revision, changes_fd);
105	105	}
106	106
107	107	void cancel() {

backends/flint/flint_table.cc

     base_.set_block_size(block_size_);
     base_.set_have_fakeroot(true);
     base_.set_sequential(true);
     base_.write_to_file(name + "baseA");
+    base_.write_to_file(name + "baseA", -1);
     /* remove the alternative base file, if any */
     sys_unlink_if_exists(name + "baseB");
 …
+}
 void
 FlintTable::commit(flint_revision_number_t revision)
+FlintTable::commit(flint_revision_number_t revision, int changes_fd)
+{
     DEBUGCALL(DB, void, "FlintTable::commit", revision);
     Assert(writable);
 …
         C[i].rewrite = false;
+    }
     base.write_to_file(name + "base" + char(base_letter));
+    base.write_to_file(name + "base" + char(base_letter), changes_fd);
     base.commit();
     read_root();
 …
     seq_count = SEQ_START_POINT;
+}
 void
+FlintTable::write_changed_blocks(int changes_fd)
+{
+    Assert(changes_fd >= 0);
+    // Compare the old and new bitmaps to find blocks which have changed, and
+    // write them to the file descriptor.
+    uint4 n = 0;
+    while (base.find_changed_block(&n)) {
+        // Write block n to the file.
+        n += 1;
+    }
+}
+void
 FlintTable::cancel()
+{
     DEBUGCALL(DB, void, "FlintTable::cancel", "");

backends/flint/flint_table.h

          *          be greater than the latest revision number (see
          *          get_latest_revision_number()), or an exception will be
          *          thrown.
+         *
+         *  @param changes_fd  The file descriptor to write changes to.  If -1,
+         *          no changes will be written.
          */
         void commit(flint_revision_number_t revision);
+        void commit(flint_revision_number_t revision, int changes_fd);
+        /** Append the list of blocks changed to a changeset file.
+         *
+         *  @param changes_fd  The file descriptor to write changes to.
+         */
+        void write_changed_blocks(int changes_fd);
         /** Cancel any outstanding changes.
+         *
          *  This will discard any modifications which haven't been committed

backends/flint/flint_btreebase.h

+        }
         /** Write the btree base file to disk. */
         void write_to_file(const std::string &filename);
+        void write_to_file(const std::string &filename, int changes_fd);
         /* Methods dealing with the bitmap */
         /** true iff block n was free at the start of the transaction on
 …
         uint4 next_free_block();
+        /** Find the first changed block at or after position *n.
+         *
+         *  Returns true if such a block was found, or false otherwise.
+         */
+        bool find_changed_block(uint4 * n);
         bool block_free_now(uint4 n);
         void calculate_last_block();

backends/flint/flint_database.cc

  * Copyright 2001 Hein Ragas
  * Copyright 2002 Ananova Ltd
  * Copyright 2002,2003,2004,2005,2006,2007 Olly Betts
  * Copyright 2006 Richard Boulton
+ * Copyright 2006,2008 Lemur Consulting Ltd
+ *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License as
 …
 #include <xapian/error.h>
 #include "safeerrno.h"
+#ifdef __WIN32__
+# include "msvc_posix_wrapper.h"
+#endif
+#include "flint_io.h"
 #include "flint_database.h"
 #include "utils.h"
 #include "omdebug.h"
 …
 // in the term).
 #define MAX_SAFE_TERM_LENGTH 245
+// Magic string used to recognise a changeset file.
+#define CHANGES_MAGIC_STRING "FlintChanges"
+// The current version of changeset files.
+// 1  - initial implementation
+#define CHANGES_VERSION 1u
 // Magic key in the postlist table (which corresponds to an invalid docid) is
 // used to store the next free docid and total length of all documents.
 static const string METAINFO_KEY("", 1);
+/** Delete file, throwing an error if we can't delete it (but not if it
+ *  doesn't exist).
+ */
+static void
+sys_unlink_if_exists(const string & filename)
+{
+#ifdef __WIN32__
+    if (msvc_posix_unlink(filename.c_str()) == -1) {
+#else
+    if (unlink(filename) == -1) {
+#endif
+        if (errno == ENOENT) return;
+        throw Xapian::DatabaseError("Can't delete file: `" + filename +
+                              "': " + strerror(errno));
+    }
+}
 /* This finds the tables, opens them at consistent revisions, manages
  * determining the current and next revision numbers, and stores handles
  * to the tables.
 …
 FlintDatabase::set_revision_number(flint_revision_number_t new_revision)
+{
     DEBUGCALL(DB, void, "FlintDatabase::set_revision_number", new_revision);
+    postlist_table.commit(new_revision);
+    position_table.commit(new_revision);
+    termlist_table.commit(new_revision);
+    value_table.commit(new_revision);
+    synonym_table.commit(new_revision);
+    spelling_table.commit(new_revision);
+    record_table.commit(new_revision);
+    int changes_fd = -1;
+    string changes_name;
+    if (1) { // FIXME - toggle writing of changesets somehow.
+        changes_name = db_dir + "/changes" + om_tostring(new_revision);
+#ifdef __WIN32__
+        changes_fd = msvc_posix_open(changes_name.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_BINARY);
+#else
+        changes_fd = open(changes_name.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0666);
+#endif
+        if (changes_fd < 0) {
+            string message = string("Couldn't open changeset ")
+                    + changes_name + " to write: " + strerror(errno);
+            throw Xapian::DatabaseOpeningError(message);
+        }
+    }
+    try {
+        fdcloser closefd(changes_fd);
+        if (changes_fd >= 0) {
+            string buf;
+            flint_revision_number_t old_revision =
+                    record_table.get_open_revision_number();
+            buf += CHANGES_MAGIC_STRING;
+            buf += pack_uint(CHANGES_VERSION);
+            buf += pack_uint(old_revision);
+            buf += pack_uint(new_revision);
+            // FIXME - if DANGEROUS mode is in use, this should contain pack_uint(1u)
+            buf += pack_uint(0u); // Changes can be applied to a live database.
+            flint_io_write(changes_fd, buf.data(), buf.size());
+            // Write the changes to the blocks in the tables.  Do the postlist
+            // table last, so that ends up cached the most, if the cache
+            // available is limited.  Do the position and value tables just
+            // before that, because they're also critical to search speed.
+            termlist_table.write_changed_blocks(changes_fd);
+            synonym_table.write_changed_blocks(changes_fd);
+            spelling_table.write_changed_blocks(changes_fd);
+            record_table.write_changed_blocks(changes_fd);
+            position_table.write_changed_blocks(changes_fd);
+            value_table.write_changed_blocks(changes_fd);
+            postlist_table.write_changed_blocks(changes_fd);
+        }
+        postlist_table.commit(new_revision, changes_fd);
+        position_table.commit(new_revision, changes_fd);
+        termlist_table.commit(new_revision, changes_fd);
+        value_table.commit(new_revision, changes_fd);
+        synonym_table.commit(new_revision, changes_fd);
+        spelling_table.commit(new_revision, changes_fd);
+        record_table.commit(new_revision, changes_fd);
+        if (changes_fd >= 0) {
+            string buf;
+            buf += '\0';
+            buf += pack_uint(new_revision);
+            flint_io_write(changes_fd, buf.data(), buf.size());
+            // FIXME - should really be calling flint_io_sync() on the changes
+            // file before writing the new base for the record table, to ensure
+            // that the changes file doesn't get lost (or, worse, left only
+            // partially recorded on disk) if the system fails after the sync
+            // of the record table, but before the sync of the changes file.
+            flint_io_sync(changes_fd);
+        }
+    } catch(...) {
+        // Remove the changeset, if there was one.
+        if (changes_fd >= 0) {
+            sys_unlink_if_exists(changes_name);
+        }
+        throw;
+    }
+}
 void

backends/flint/flint_synonym.h


102	102	FlintTable::set_block_size(blocksize);
103	103	}
104	104
105		void commit(flint_revision_number_t revision) {
	105	void commit(flint_revision_number_t revision, int changes_fd) {
106	106	merge_changes();
107		FlintTable::commit(revision);
	107	FlintTable::commit(revision, changes_fd);
108	108	}
109	109
110	110	void cancel() {

backends/flint/flint_btreebase.cc

+}
 void
 FlintTable_base::write_to_file(const string &filename)
+FlintTable_base::write_to_file(const string &filename, int changes_fd)
+{
     calculate_last_block();
 …
+    }
     fdcloser closefd(h);
+    if (changes_fd >= 0) {
+        string changesbuf;
+        changesbuf += pack_uint(1u); // Indicates that this item is a base file.
+        changesbuf += filename[filename.size() - 1]; // The letter of the base file.
+        changesbuf += pack_uint(buf.size());
+        flint_io_write(changes_fd, changesbuf.data(), changesbuf.size());
+        flint_io_write(changes_fd, buf.data(), buf.size());
+    }
     flint_io_write(h, buf.data(), buf.size());
     flint_io_sync(h);
+}
 …
     return n;
+}
 bool
+FlintTable_base::find_changed_block(uint4 * n)
+{
+    // Search for a block which was free at the start of the transaction, but
+    // isn't now.
+    (void) n;
+    // FIXME - implement
+    return false;
+}
+bool
 FlintTable_base::block_free_now(uint4 n)
+{
     uint4 i = n / CHAR_BIT;

Context Navigation

Ticket #227: patch

docs/replication_protocol.rst

docs/replication.rst

docs/Makefile.am

bin/xapian-compact.cc

backends/flint/flint_spelling.h

backends/flint/flint_table.cc

backends/flint/flint_table.h

backends/flint/flint_btreebase.h

backends/flint/flint_database.cc

backends/flint/flint_synonym.h

backends/flint/flint_btreebase.cc

Download in other formats: