Index: docs/replication_protocol.rst
===================================================================
--- docs/replication_protocol.rst	(revision 0)
+++ docs/replication_protocol.rst	(revision 0)
@@ -0,0 +1,85 @@
+.. Copyright (C) 2008 Lemur Consulting Ltd
+
+====================================
+Xapian Database Replication Protocol
+====================================
+
+.. contents:: Table of contents
+
+This document contains details of the implementation of the replication
+protocol, version 1.  For details of how and why to use the replication
+protocol, see the separate `Replication Users Guide <replication.html>`_
+document.
+
+Protocol description
+====================
+
+FIXME - describe the protocol used to transfer the updates.
+
+
+Changeset files
+===============
+
+Changes are represented by changeset files.  When changeset logging is enabled
+for a flint database, just before each commit a changeset file is created in
+the database directory.  This file contains a record of the changes made,
+currently in the following format (but note that this format may change between
+implementations of flint):
+
+ - 12 bytes holding the string "FlintChanges" (used to check that a file is a
+   changeset file).
+
+ - The format of the changeset (as a variable length unsigned integer).
+
+ - The revision number of the database before the changes were applied (as a
+   variable length unsigned integer).
+
+ - The revision number of the database after the changes were applied (as a
+   variable length unsigned integer).
+
+ - A byte:
+
+   - 0 if the changes can safely be applied to a live database
+   
+   - 1 if the changes cannot be applied while searching is in progress.  (This
+     will be set if the database was modified in "DANGEROUS" mode).
+
+   - 2 if the changes contain a whole database copy (or at least, a copy of all
+     active blocks in a database), in which case the changes should be used to
+     make a brand new database.
+
+ - A series of items:
+
+   - A byte: 0 if there are no more items in the changeset, 1 if the next item
+     is a base file, 2 if the next item is a list of blocks.
+
+   - A string, holding a table name.  (preceded by the length of the string as
+     a variable length unsigned integer)
+
+   - If a base file:
+
+     - The letter of the base file (currently 'A' or 'B').
+
+     - The length of the file (as a variable length unsigned integer).
+
+     - The contents of the base file.
+
+   - If a list of blocks:
+
+     - The blocksize in use.
+
+     - A list of items:
+
+       - A variable length unsigned integer holding 0 if the list is at an end,
+	 or holding (block number + 1) otherwise.
+
+       - The contents of the block.
+
+ - A revision number that the database must be upgraded to, with more
+   changesets, before it is safe to be made live.  This will normally be the
+   revision number of the database after the changes were applied, but if the
+   changeset was created by copying parts of the database without a read lock,
+   and modifications were made to the database while the copy was in progress,
+   parts of the copy may contain later revisions than intended - in this
+   situation futher changesets will be needed to ensure that these parts of the
+   database are fully integrated.
Index: docs/replication.rst
===================================================================
--- docs/replication.rst	(revision 0)
+++ docs/replication.rst	(revision 0)
@@ -0,0 +1,203 @@
+.. Copyright (C) 2008 Lemur Consulting Ltd
+
+=======================================
+Xapian Database Replication Users Guide
+=======================================
+
+.. contents:: Table of contents
+
+Introduction
+============
+
+It is often desirable to maintain multiple copies of a Xapian database, having
+a "master" database which modifications are made on, and a set of secondary
+(read-only, "slave") databases which these modifications propagate to.  For
+example, to support a high query load there may be many search servers, each
+with a local copy of the database, and a single indexing server.  In order to
+allow scaling to a large number of search servers, with large databases and
+frequent updates, we need an database replication implementation to have the
+following characteristics:
+
+ - Data transfer is (at most) proportional to the size of the updates, rather
+   than the size of the database, to allow frequent small updates to large
+   databases to be replicated efficiently.
+
+ - Searching (on the slave databases) and indexing (on the master database) can
+   continue during synchronisation.
+
+ - Data cached (in memory) on the slave databases is not discarded (unless it's
+   actually out of date) as updates arrive, to ensure that searches continue to
+   be performed quickly during and after updates.
+
+ - Synchronising each slave database involves low overhead (both IO and CPU) on
+   the server holding the master database, so that many slaves can be updated
+   from a single master without overloading it.
+
+ - Database synchronisation can be recovered after network outages or server
+   failures without manual intervention and without excessive data transfer.
+
+The database replication protocol is intended to support replicating a single
+writable database to multiple (read-only) search servers, while satisfying all
+of the above properties.  It is not intended to support replication of multiple
+writable databases - there must always be a single master database to which all
+modifications are made.
+
+This document gives an overview of how and why to use the replication protocol.
+For technical details of the implementation of the replication protocol, see
+the separate `Replication Protocol <replication_protocol.html>`_ document.
+
+Setting up replicated databases
+===============================
+
+FIXME - describe how to set up a set of replicated databases.
+
+
+Alternative approaches
+======================
+
+Without using the database replication protocol, there are various ways in
+which the "single master, multiple slaves" setup could be implemented.
+
+ - Copy database from master to all slaves after each update, then swap the new
+   database for the old.
+
+ - Synchronise databases from the master to the on the slaves using rsync.
+
+ - Keep copy of database on master from before each update, and use a binary
+   diff algorithm (e.g., xdelta) to calculate the changes, and then apply these
+   same changes to the databases on each slave.
+
+ - Serve database from master to slaves over NFS (or other remote file system).
+
+ - Use the "remote database backend" facility of Xapian to allow slave servers
+   to search the database directly on the master.
+
+All of these could be made to work but have various drawbacks, and fail to
+satisfy all the desired characteristics.  Let's examine them in detail:
+
+Copying database after each update
+----------------------------------
+
+Databases could be pushed to the slaves after each update simply by copying the
+entire database from the master (using scp, ftp, http or one of the many other
+transfer options).  After the copy is completed, the new database would be made
+live (perhaps by symlink switching, if symlinks are available).  After a
+reasonable interval to allow searches in progress on the old database to
+complete, the old database would be removed.  (On UNIX filesystems, the old
+database could be unlinked immediately, and the resources used by it would be
+automatically freed as soon as the current searches using it complete.)
+
+This approach has the advantage of simplicity, and also ensures that the
+databases can be correctly re-synchronised after network outages or hardware
+failure.
+
+However, this approach would involve copying a large amount of data for each
+update, however small the update was.  Also, because the search server would
+have to switch to access new files each time an update was pushed, the search
+server will be likely to experience poor performance due to commonly accessed
+pages falling out of the disk cache during the update.  In particular, although
+some of the newly pushed data would be likely to be in the cache immediately
+after the update, if the combination of the old and new database sizes exceeds
+the size of the memory available on the search servers for caching, either some
+of the live database will be dropped from the cache resulting in poor
+performance during the update, or some of the new database will not initially
+be present in the cache after update.
+
+Synchronise database using rsync
+--------------------------------
+
+Rsync works by calculating hashes for the content on the client and the server,
+sending the hashes from the client to the server, and then calculating (on the
+server) which pieces of the file need to be sent to update the client.  This
+results in a fairly low amount of network traffic, but puts a fairly high CPU
+load on the server.  This would result in a large load being placed on the
+master server if a large number of slaves tried to synchronise with it.
+
+Also, rsync will not reliably update the database in a manner which allows the
+database on a slave to be searched while being updated - therefore, a copy or
+snapshot of the database would need to be taken first to allow searches to
+continue (accessing the copy) while the database is being synchronised.
+
+If a copy is used, the caching problems discussed in the previous section would
+apply again.  If a snapshotting filesystem is used, it may be possible to take
+a read-only snapshot copy cheaply (and without encountering poor caching
+behaviour), but filesystems with support for this are not always available, and
+may require considerable effort to set up even if they are available.
+
+Use a binary diff algorithm
+---------------------------
+
+If a copy of the database on the master before the update was kept, a binary
+diff algorithm (such as "xdelta") could be used to compare the old and new
+versions of the database.  This would produce a patch file which could be
+transferred to the slaves, and then applied - avoiding the need for specific
+calculations to be performed for each slave.
+
+However, this requires a copy or snapshot to be taken on the master - which has
+the same problems as previously discussed.  A copy or snapshot would also need
+to be taken on the slave, since a patch from xdelta couldn't safely be applied
+to a live database.
+
+Serve database from master to slaves over NFS
+---------------------------------------------
+
+NFS allows a section of a filesystem to be exported to a remote host.  Xapian
+is quite capable of searching a database which is exported in such a manner,
+and thus NFS can be used to quickly and easily share a database from the master
+to multiple slaves.
+
+A reasonable setup might be to use a powerful machine with a fast disk as the
+master, and use that same machine as an NFS server.  Then, multiple slaves can
+connect to that NFS server for searching the database. This setup is quite
+convenient, because it separates the indexing workload from the search workload
+to a reasonable extent, but may lead to performance problems.
+
+There are two main problems which are likely to be encountered.  Firstly, in
+order to work efficiently, NFS clients (or the OS filesystem layer above NFS)
+cache information read from the remote file system in memory.  If there is
+insufficient memory available to cache the whole database in memory, searches
+will occasionally need to access parts of the database which are held only on
+the master server.  Such searches will take a long time to complete, because
+the round-trip time for an access to a disk block on the master is typically a
+lot slower than the round-trip time for access to a local disk.  Additionally,
+if the local network experiences problems, or the master server fails (or gets
+overloaded due to all the search requests), the searches will be unable to be
+completed.
+
+Also, when a file is modified, the NFS protocol has no way of indicating that
+only a small set of blocks in the file have been modified.  The caching is all
+implemented by NFS clients, which can do little other than check the file
+modification time periodically, and invalidate all cached blocks for the file
+if the modification time has changed. For the Linux client, the time between
+checks can be configured by setting the acregmin and acregmax mount options,
+but whatever these are set to, the whole file will be dropped from the cache
+when any modification is found.
+
+This means that, after every update to the database on the master, searches on
+the slaves will have to fetch all the blocks required for their search across
+the network, which will likely result in extremely slow search times until the
+cache on the slaves gets populated properly again.
+
+Use the "remote database backend" facility
+------------------------------------------
+
+Xapian has supported a "remote" database backend since the very early days of
+the project.  This allows a search to be run against a database on a remote
+machine, which may seem to be exactly what we want.  However, the "remote"
+database backend works by performing most of the work for a search on the
+remote end - in the situation we're concerned with, this would mean that most
+of the work was performed on the master, while slaves remain largely idle.
+
+The "remote" database backend is intended to allow a large database to be
+split, at the document level, between multiple hosts.  This allows systems to
+be built which search a very large database with some degree of parallelism
+(and thus provide faster individual searches than a system searching a single
+database locally).  In contrast, the database replication protocol is intended
+to allow a database to be copied to multiple machines to support a high
+concurrent search load (and thus to allow a higher throughput of searches).
+
+In some cases (i.e., a very large database and a high concurrent search load)
+it may be perfectly reasonable to use both the database replication protocol in
+conjunction with the "remote" database backend to get both of these advantages
+- the two systems solve different problems.
+
Index: docs/Makefile.am
===================================================================
--- docs/Makefile.am	(revision 10008)
+++ docs/Makefile.am	(working copy)
@@ -11,8 +11,9 @@
  bm25.html code_structure.html queryparser.html \
  quickstartexpand.cc.html quickstartindex.cc.html quickstartsearch.cc.html
 
-RSTDOCS = admin_notes.rst deprecation.rst glossary.rst sorting.rst \
- spelling.rst synonyms.rst termgenerator.rst valueranges.rst
+RSTDOCS = admin_notes.rst deprecation.rst glossary.rst \
+ replication.rst replication_protocol.rst \
+ sorting.rst spelling.rst synonyms.rst termgenerator.rst valueranges.rst
 RSTHTML = $(RSTDOCS:.rst=.html)
 
 # Files which should be put in the distribution by automake
Index: bin/xapian-compact.cc
===================================================================
--- bin/xapian-compact.cc	(revision 10008)
+++ bin/xapian-compact.cc	(working copy)
@@ -633,7 +633,7 @@
 		}
 	    }
 	    tmpout.push_back(dest);
-	    tmptab.commit(1);
+	    tmptab.commit(1, -1);
 	}
 	swap(tmp, tmpout);
 	swap(off, newoff);
@@ -941,7 +941,7 @@
 	    }
 
 	    // Commit as revision 1.
-	    out.commit(1);
+	    out.commit(1, -1);
 
 	    cout << '\r' << t->name << ": ";
 	    off_t out_size = 0;
Index: backends/flint/flint_spelling.h
===================================================================
--- backends/flint/flint_spelling.h	(revision 10008)
+++ backends/flint/flint_spelling.h	(working copy)
@@ -99,9 +99,9 @@
 	FlintTable::set_block_size(blocksize);
     }
 
-    void commit(flint_revision_number_t revision) {
+    void commit(flint_revision_number_t revision, int changes_fd) {
 	merge_changes();
-	FlintTable::commit(revision);
+	FlintTable::commit(revision, changes_fd);
     }
 
     void cancel() {
Index: backends/flint/flint_table.cc
===================================================================
--- backends/flint/flint_table.cc	(revision 10008)
+++ backends/flint/flint_table.cc	(working copy)
@@ -1689,7 +1689,7 @@
     base_.set_block_size(block_size_);
     base_.set_have_fakeroot(true);
     base_.set_sequential(true);
-    base_.write_to_file(name + "baseA");
+    base_.write_to_file(name + "baseA", -1);
 
     /* remove the alternative base file, if any */
     sys_unlink_if_exists(name + "baseB");
@@ -1727,7 +1727,7 @@
 }
 
 void
-FlintTable::commit(flint_revision_number_t revision)
+FlintTable::commit(flint_revision_number_t revision, int changes_fd)
 {
     DEBUGCALL(DB, void, "FlintTable::commit", revision);
     Assert(writable);
@@ -1789,7 +1789,7 @@
 	C[i].rewrite = false;
     }
 
-    base.write_to_file(name + "base" + char(base_letter));
+    base.write_to_file(name + "base" + char(base_letter), changes_fd);
     base.commit();
 
     read_root();
@@ -1799,7 +1799,24 @@
     seq_count = SEQ_START_POINT;
 }
 
+
 void
+FlintTable::write_changed_blocks(int changes_fd)
+{
+    Assert(changes_fd >= 0);
+
+    // Compare the old and new bitmaps to find blocks which have changed, and
+    // write them to the file descriptor.
+
+    uint4 n = 0;
+    while (base.find_changed_block(&n)) {
+	// Write block n to the file.
+	
+	n += 1;
+    }
+}
+
+void
 FlintTable::cancel()
 {
     DEBUGCALL(DB, void, "FlintTable::cancel", "");
Index: backends/flint/flint_table.h
===================================================================
--- backends/flint/flint_table.h	(revision 10008)
+++ backends/flint/flint_table.h	(working copy)
@@ -330,9 +330,18 @@
 	 *          be greater than the latest revision number (see
 	 *          get_latest_revision_number()), or an exception will be
 	 *          thrown.
+	 *
+	 *  @param changes_fd  The file descriptor to write changes to.  If -1,
+	 *          no changes will be written.
 	 */
-	void commit(flint_revision_number_t revision);
+	void commit(flint_revision_number_t revision, int changes_fd);
 
+	/** Append the list of blocks changed to a changeset file.
+	 *
+	 *  @param changes_fd  The file descriptor to write changes to.
+	 */
+	void write_changed_blocks(int changes_fd);
+
 	/** Cancel any outstanding changes.
 	 *
 	 *  This will discard any modifications which haven't been committed
Index: backends/flint/flint_btreebase.h
===================================================================
--- backends/flint/flint_btreebase.h	(revision 10008)
+++ backends/flint/flint_btreebase.h	(working copy)
@@ -86,7 +86,7 @@
 	}
 
 	/** Write the btree base file to disk. */
-	void write_to_file(const std::string &filename);
+	void write_to_file(const std::string &filename, int changes_fd);
 
 	/* Methods dealing with the bitmap */
 	/** true iff block n was free at the start of the transaction on
@@ -98,6 +98,12 @@
 
 	uint4 next_free_block();
 
+	/** Find the first changed block at or after position *n.
+	 *
+	 *  Returns true if such a block was found, or false otherwise.
+	 */
+	bool find_changed_block(uint4 * n);
+
 	bool block_free_now(uint4 n);
 
 	void calculate_last_block();
Index: backends/flint/flint_database.cc
===================================================================
--- backends/flint/flint_database.cc	(revision 10008)
+++ backends/flint/flint_database.cc	(working copy)
@@ -4,7 +4,7 @@
  * Copyright 2001 Hein Ragas
  * Copyright 2002 Ananova Ltd
  * Copyright 2002,2003,2004,2005,2006,2007 Olly Betts
- * Copyright 2006 Richard Boulton
+ * Copyright 2006,2008 Lemur Consulting Ltd
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License as
@@ -27,7 +27,11 @@
 #include <xapian/error.h>
 
 #include "safeerrno.h"
+#ifdef __WIN32__
+# include "msvc_posix_wrapper.h"
+#endif
 
+#include "flint_io.h"
 #include "flint_database.h"
 #include "utils.h"
 #include "omdebug.h"
@@ -67,10 +71,34 @@
 // in the term).
 #define MAX_SAFE_TERM_LENGTH 245
 
+// Magic string used to recognise a changeset file.
+#define CHANGES_MAGIC_STRING "FlintChanges"
+
+// The current version of changeset files.
+// 1  - initial implementation
+#define CHANGES_VERSION 1u
+
 // Magic key in the postlist table (which corresponds to an invalid docid) is
 // used to store the next free docid and total length of all documents.
 static const string METAINFO_KEY("", 1);
 
+/** Delete file, throwing an error if we can't delete it (but not if it
+ *  doesn't exist).
+ */
+static void
+sys_unlink_if_exists(const string & filename)
+{
+#ifdef __WIN32__
+    if (msvc_posix_unlink(filename.c_str()) == -1) {
+#else
+    if (unlink(filename) == -1) {
+#endif
+	if (errno == ENOENT) return;
+	throw Xapian::DatabaseError("Can't delete file: `" + filename +
+			      "': " + strerror(errno));
+    }
+}
+
 /* This finds the tables, opens them at consistent revisions, manages
  * determining the current and next revision numbers, and stores handles
  * to the tables.
@@ -337,13 +365,81 @@
 FlintDatabase::set_revision_number(flint_revision_number_t new_revision)
 {
     DEBUGCALL(DB, void, "FlintDatabase::set_revision_number", new_revision);
-    postlist_table.commit(new_revision);
-    position_table.commit(new_revision);
-    termlist_table.commit(new_revision);
-    value_table.commit(new_revision);
-    synonym_table.commit(new_revision);
-    spelling_table.commit(new_revision);
-    record_table.commit(new_revision);
+
+    int changes_fd = -1;
+    string changes_name;
+
+    if (1) { // FIXME - toggle writing of changesets somehow.
+	changes_name = db_dir + "/changes" + om_tostring(new_revision);
+#ifdef __WIN32__
+	changes_fd = msvc_posix_open(changes_name.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_BINARY);
+#else
+	changes_fd = open(changes_name.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0666);
+#endif
+	if (changes_fd < 0) {
+	    string message = string("Couldn't open changeset ")
+		    + changes_name + " to write: " + strerror(errno);
+	    throw Xapian::DatabaseOpeningError(message);
+	}
+    }
+
+    try {
+	fdcloser closefd(changes_fd);
+	if (changes_fd >= 0) {
+	    string buf;
+	    flint_revision_number_t old_revision =
+		    record_table.get_open_revision_number();
+	    buf += CHANGES_MAGIC_STRING;
+	    buf += pack_uint(CHANGES_VERSION);
+	    buf += pack_uint(old_revision);
+	    buf += pack_uint(new_revision);
+
+	    // FIXME - if DANGEROUS mode is in use, this should contain pack_uint(1u)
+	    buf += pack_uint(0u); // Changes can be applied to a live database.
+
+	    flint_io_write(changes_fd, buf.data(), buf.size());
+
+	    // Write the changes to the blocks in the tables.  Do the postlist
+	    // table last, so that ends up cached the most, if the cache
+	    // available is limited.  Do the position and value tables just
+	    // before that, because they're also critical to search speed.
+	    termlist_table.write_changed_blocks(changes_fd);
+	    synonym_table.write_changed_blocks(changes_fd);
+	    spelling_table.write_changed_blocks(changes_fd);
+	    record_table.write_changed_blocks(changes_fd);
+	    position_table.write_changed_blocks(changes_fd);
+	    value_table.write_changed_blocks(changes_fd);
+	    postlist_table.write_changed_blocks(changes_fd);
+	}
+
+	postlist_table.commit(new_revision, changes_fd);
+	position_table.commit(new_revision, changes_fd);
+	termlist_table.commit(new_revision, changes_fd);
+	value_table.commit(new_revision, changes_fd);
+	synonym_table.commit(new_revision, changes_fd);
+	spelling_table.commit(new_revision, changes_fd);
+	record_table.commit(new_revision, changes_fd);
+
+	if (changes_fd >= 0) {
+	    string buf;
+	    buf += '\0';
+	    buf += pack_uint(new_revision);
+	    flint_io_write(changes_fd, buf.data(), buf.size());
+	    // FIXME - should really be calling flint_io_sync() on the changes
+	    // file before writing the new base for the record table, to ensure
+	    // that the changes file doesn't get lost (or, worse, left only
+	    // partially recorded on disk) if the system fails after the sync
+	    // of the record table, but before the sync of the changes file.
+	    flint_io_sync(changes_fd);
+	}
+    } catch(...) {
+	// Remove the changeset, if there was one.
+	if (changes_fd >= 0) {
+	    sys_unlink_if_exists(changes_name);
+	}
+
+	throw;
+    }
 }
 
 void
Index: backends/flint/flint_synonym.h
===================================================================
--- backends/flint/flint_synonym.h	(revision 10008)
+++ backends/flint/flint_synonym.h	(working copy)
@@ -102,9 +102,9 @@
 	FlintTable::set_block_size(blocksize);
     }
 
-    void commit(flint_revision_number_t revision) {
+    void commit(flint_revision_number_t revision, int changes_fd) {
 	merge_changes();
-	FlintTable::commit(revision);
+	FlintTable::commit(revision, changes_fd);
     }
 
     void cancel() {
Index: backends/flint/flint_btreebase.cc
===================================================================
--- backends/flint/flint_btreebase.cc	(revision 10009)
+++ backends/flint/flint_btreebase.cc	(working copy)
@@ -277,7 +277,7 @@
 }
 
 void
-FlintTable_base::write_to_file(const string &filename)
+FlintTable_base::write_to_file(const string &filename, int changes_fd)
 {
     calculate_last_block();
 
@@ -310,6 +310,15 @@
     }
     fdcloser closefd(h);
 
+    if (changes_fd >= 0) {
+	string changesbuf;
+	changesbuf += pack_uint(1u); // Indicates that this item is a base file.
+	changesbuf += filename[filename.size() - 1]; // The letter of the base file.
+	changesbuf += pack_uint(buf.size());
+	flint_io_write(changes_fd, changesbuf.data(), changesbuf.size());
+	flint_io_write(changes_fd, buf.data(), buf.size());
+    }
+
     flint_io_write(h, buf.data(), buf.size());
     flint_io_sync(h);
 }
@@ -412,7 +421,18 @@
     return n;
 }
 
+
 bool
+FlintTable_base::find_changed_block(uint4 * n)
+{
+    // Search for a block which was free at the start of the transaction, but
+    // isn't now.
+    (void) n;
+    // FIXME - implement
+    return false;
+}
+
+bool
 FlintTable_base::block_free_now(uint4 n)
 {
     uint4 i = n / CHAR_BIT;