Ticket #227: patch

File patch, 25.6 KB (added by Richard Boulton, 16 years ago)

Work in progress patch

  • docs/replication_protocol.rst

     
     1.. Copyright (C) 2008 Lemur Consulting Ltd
     2
     3====================================
     4Xapian Database Replication Protocol
     5====================================
     6
     7.. contents:: Table of contents
     8
     9This document contains details of the implementation of the replication
     10protocol, version 1.  For details of how and why to use the replication
     11protocol, see the separate `Replication Users Guide <replication.html>`_
     12document.
     13
     14Protocol description
     15====================
     16
     17FIXME - describe the protocol used to transfer the updates.
     18
     19
     20Changeset files
     21===============
     22
     23Changes are represented by changeset files.  When changeset logging is enabled
     24for a flint database, just before each commit a changeset file is created in
     25the database directory.  This file contains a record of the changes made,
     26currently in the following format (but note that this format may change between
     27implementations of flint):
     28
     29 - 12 bytes holding the string "FlintChanges" (used to check that a file is a
     30   changeset file).
     31
     32 - The format of the changeset (as a variable length unsigned integer).
     33
     34 - The revision number of the database before the changes were applied (as a
     35   variable length unsigned integer).
     36
     37 - The revision number of the database after the changes were applied (as a
     38   variable length unsigned integer).
     39
     40 - A byte:
     41
     42   - 0 if the changes can safely be applied to a live database
     43   
     44   - 1 if the changes cannot be applied while searching is in progress.  (This
     45     will be set if the database was modified in "DANGEROUS" mode).
     46
     47   - 2 if the changes contain a whole database copy (or at least, a copy of all
     48     active blocks in a database), in which case the changes should be used to
     49     make a brand new database.
     50
     51 - A series of items:
     52
     53   - A byte: 0 if there are no more items in the changeset, 1 if the next item
     54     is a base file, 2 if the next item is a list of blocks.
     55
     56   - A string, holding a table name.  (preceded by the length of the string as
     57     a variable length unsigned integer)
     58
     59   - If a base file:
     60
     61     - The letter of the base file (currently 'A' or 'B').
     62
     63     - The length of the file (as a variable length unsigned integer).
     64
     65     - The contents of the base file.
     66
     67   - If a list of blocks:
     68
     69     - The blocksize in use.
     70
     71     - A list of items:
     72
     73       - A variable length unsigned integer holding 0 if the list is at an end,
     74         or holding (block number + 1) otherwise.
     75
     76       - The contents of the block.
     77
     78 - A revision number that the database must be upgraded to, with more
     79   changesets, before it is safe to be made live.  This will normally be the
     80   revision number of the database after the changes were applied, but if the
     81   changeset was created by copying parts of the database without a read lock,
     82   and modifications were made to the database while the copy was in progress,
     83   parts of the copy may contain later revisions than intended - in this
     84   situation futher changesets will be needed to ensure that these parts of the
     85   database are fully integrated.
  • docs/replication.rst

     
     1.. Copyright (C) 2008 Lemur Consulting Ltd
     2
     3=======================================
     4Xapian Database Replication Users Guide
     5=======================================
     6
     7.. contents:: Table of contents
     8
     9Introduction
     10============
     11
     12It is often desirable to maintain multiple copies of a Xapian database, having
     13a "master" database which modifications are made on, and a set of secondary
     14(read-only, "slave") databases which these modifications propagate to.  For
     15example, to support a high query load there may be many search servers, each
     16with a local copy of the database, and a single indexing server.  In order to
     17allow scaling to a large number of search servers, with large databases and
     18frequent updates, we need an database replication implementation to have the
     19following characteristics:
     20
     21 - Data transfer is (at most) proportional to the size of the updates, rather
     22   than the size of the database, to allow frequent small updates to large
     23   databases to be replicated efficiently.
     24
     25 - Searching (on the slave databases) and indexing (on the master database) can
     26   continue during synchronisation.
     27
     28 - Data cached (in memory) on the slave databases is not discarded (unless it's
     29   actually out of date) as updates arrive, to ensure that searches continue to
     30   be performed quickly during and after updates.
     31
     32 - Synchronising each slave database involves low overhead (both IO and CPU) on
     33   the server holding the master database, so that many slaves can be updated
     34   from a single master without overloading it.
     35
     36 - Database synchronisation can be recovered after network outages or server
     37   failures without manual intervention and without excessive data transfer.
     38
     39The database replication protocol is intended to support replicating a single
     40writable database to multiple (read-only) search servers, while satisfying all
     41of the above properties.  It is not intended to support replication of multiple
     42writable databases - there must always be a single master database to which all
     43modifications are made.
     44
     45This document gives an overview of how and why to use the replication protocol.
     46For technical details of the implementation of the replication protocol, see
     47the separate `Replication Protocol <replication_protocol.html>`_ document.
     48
     49Setting up replicated databases
     50===============================
     51
     52FIXME - describe how to set up a set of replicated databases.
     53
     54
     55Alternative approaches
     56======================
     57
     58Without using the database replication protocol, there are various ways in
     59which the "single master, multiple slaves" setup could be implemented.
     60
     61 - Copy database from master to all slaves after each update, then swap the new
     62   database for the old.
     63
     64 - Synchronise databases from the master to the on the slaves using rsync.
     65
     66 - Keep copy of database on master from before each update, and use a binary
     67   diff algorithm (e.g., xdelta) to calculate the changes, and then apply these
     68   same changes to the databases on each slave.
     69
     70 - Serve database from master to slaves over NFS (or other remote file system).
     71
     72 - Use the "remote database backend" facility of Xapian to allow slave servers
     73   to search the database directly on the master.
     74
     75All of these could be made to work but have various drawbacks, and fail to
     76satisfy all the desired characteristics.  Let's examine them in detail:
     77
     78Copying database after each update
     79----------------------------------
     80
     81Databases could be pushed to the slaves after each update simply by copying the
     82entire database from the master (using scp, ftp, http or one of the many other
     83transfer options).  After the copy is completed, the new database would be made
     84live (perhaps by symlink switching, if symlinks are available).  After a
     85reasonable interval to allow searches in progress on the old database to
     86complete, the old database would be removed.  (On UNIX filesystems, the old
     87database could be unlinked immediately, and the resources used by it would be
     88automatically freed as soon as the current searches using it complete.)
     89
     90This approach has the advantage of simplicity, and also ensures that the
     91databases can be correctly re-synchronised after network outages or hardware
     92failure.
     93
     94However, this approach would involve copying a large amount of data for each
     95update, however small the update was.  Also, because the search server would
     96have to switch to access new files each time an update was pushed, the search
     97server will be likely to experience poor performance due to commonly accessed
     98pages falling out of the disk cache during the update.  In particular, although
     99some of the newly pushed data would be likely to be in the cache immediately
     100after the update, if the combination of the old and new database sizes exceeds
     101the size of the memory available on the search servers for caching, either some
     102of the live database will be dropped from the cache resulting in poor
     103performance during the update, or some of the new database will not initially
     104be present in the cache after update.
     105
     106Synchronise database using rsync
     107--------------------------------
     108
     109Rsync works by calculating hashes for the content on the client and the server,
     110sending the hashes from the client to the server, and then calculating (on the
     111server) which pieces of the file need to be sent to update the client.  This
     112results in a fairly low amount of network traffic, but puts a fairly high CPU
     113load on the server.  This would result in a large load being placed on the
     114master server if a large number of slaves tried to synchronise with it.
     115
     116Also, rsync will not reliably update the database in a manner which allows the
     117database on a slave to be searched while being updated - therefore, a copy or
     118snapshot of the database would need to be taken first to allow searches to
     119continue (accessing the copy) while the database is being synchronised.
     120
     121If a copy is used, the caching problems discussed in the previous section would
     122apply again.  If a snapshotting filesystem is used, it may be possible to take
     123a read-only snapshot copy cheaply (and without encountering poor caching
     124behaviour), but filesystems with support for this are not always available, and
     125may require considerable effort to set up even if they are available.
     126
     127Use a binary diff algorithm
     128---------------------------
     129
     130If a copy of the database on the master before the update was kept, a binary
     131diff algorithm (such as "xdelta") could be used to compare the old and new
     132versions of the database.  This would produce a patch file which could be
     133transferred to the slaves, and then applied - avoiding the need for specific
     134calculations to be performed for each slave.
     135
     136However, this requires a copy or snapshot to be taken on the master - which has
     137the same problems as previously discussed.  A copy or snapshot would also need
     138to be taken on the slave, since a patch from xdelta couldn't safely be applied
     139to a live database.
     140
     141Serve database from master to slaves over NFS
     142---------------------------------------------
     143
     144NFS allows a section of a filesystem to be exported to a remote host.  Xapian
     145is quite capable of searching a database which is exported in such a manner,
     146and thus NFS can be used to quickly and easily share a database from the master
     147to multiple slaves.
     148
     149A reasonable setup might be to use a powerful machine with a fast disk as the
     150master, and use that same machine as an NFS server.  Then, multiple slaves can
     151connect to that NFS server for searching the database. This setup is quite
     152convenient, because it separates the indexing workload from the search workload
     153to a reasonable extent, but may lead to performance problems.
     154
     155There are two main problems which are likely to be encountered.  Firstly, in
     156order to work efficiently, NFS clients (or the OS filesystem layer above NFS)
     157cache information read from the remote file system in memory.  If there is
     158insufficient memory available to cache the whole database in memory, searches
     159will occasionally need to access parts of the database which are held only on
     160the master server.  Such searches will take a long time to complete, because
     161the round-trip time for an access to a disk block on the master is typically a
     162lot slower than the round-trip time for access to a local disk.  Additionally,
     163if the local network experiences problems, or the master server fails (or gets
     164overloaded due to all the search requests), the searches will be unable to be
     165completed.
     166
     167Also, when a file is modified, the NFS protocol has no way of indicating that
     168only a small set of blocks in the file have been modified.  The caching is all
     169implemented by NFS clients, which can do little other than check the file
     170modification time periodically, and invalidate all cached blocks for the file
     171if the modification time has changed. For the Linux client, the time between
     172checks can be configured by setting the acregmin and acregmax mount options,
     173but whatever these are set to, the whole file will be dropped from the cache
     174when any modification is found.
     175
     176This means that, after every update to the database on the master, searches on
     177the slaves will have to fetch all the blocks required for their search across
     178the network, which will likely result in extremely slow search times until the
     179cache on the slaves gets populated properly again.
     180
     181Use the "remote database backend" facility
     182------------------------------------------
     183
     184Xapian has supported a "remote" database backend since the very early days of
     185the project.  This allows a search to be run against a database on a remote
     186machine, which may seem to be exactly what we want.  However, the "remote"
     187database backend works by performing most of the work for a search on the
     188remote end - in the situation we're concerned with, this would mean that most
     189of the work was performed on the master, while slaves remain largely idle.
     190
     191The "remote" database backend is intended to allow a large database to be
     192split, at the document level, between multiple hosts.  This allows systems to
     193be built which search a very large database with some degree of parallelism
     194(and thus provide faster individual searches than a system searching a single
     195database locally).  In contrast, the database replication protocol is intended
     196to allow a database to be copied to multiple machines to support a high
     197concurrent search load (and thus to allow a higher throughput of searches).
     198
     199In some cases (i.e., a very large database and a high concurrent search load)
     200it may be perfectly reasonable to use both the database replication protocol in
     201conjunction with the "remote" database backend to get both of these advantages
     202- the two systems solve different problems.
     203
  • docs/Makefile.am

     
    1111 bm25.html code_structure.html queryparser.html \
    1212 quickstartexpand.cc.html quickstartindex.cc.html quickstartsearch.cc.html
    1313
    14 RSTDOCS = admin_notes.rst deprecation.rst glossary.rst sorting.rst \
    15  spelling.rst synonyms.rst termgenerator.rst valueranges.rst
     14RSTDOCS = admin_notes.rst deprecation.rst glossary.rst \
     15 replication.rst replication_protocol.rst \
     16 sorting.rst spelling.rst synonyms.rst termgenerator.rst valueranges.rst
    1617RSTHTML = $(RSTDOCS:.rst=.html)
    1718
    1819# Files which should be put in the distribution by automake
  • bin/xapian-compact.cc

     
    633633                }
    634634            }
    635635            tmpout.push_back(dest);
    636             tmptab.commit(1);
     636            tmptab.commit(1, -1);
    637637        }
    638638        swap(tmp, tmpout);
    639639        swap(off, newoff);
     
    941941            }
    942942
    943943            // Commit as revision 1.
    944             out.commit(1);
     944            out.commit(1, -1);
    945945
    946946            cout << '\r' << t->name << ": ";
    947947            off_t out_size = 0;
  • backends/flint/flint_spelling.h

     
    9999        FlintTable::set_block_size(blocksize);
    100100    }
    101101
    102     void commit(flint_revision_number_t revision) {
     102    void commit(flint_revision_number_t revision, int changes_fd) {
    103103        merge_changes();
    104         FlintTable::commit(revision);
     104        FlintTable::commit(revision, changes_fd);
    105105    }
    106106
    107107    void cancel() {
  • backends/flint/flint_table.cc

     
    16891689    base_.set_block_size(block_size_);
    16901690    base_.set_have_fakeroot(true);
    16911691    base_.set_sequential(true);
    1692     base_.write_to_file(name + "baseA");
     1692    base_.write_to_file(name + "baseA", -1);
    16931693
    16941694    /* remove the alternative base file, if any */
    16951695    sys_unlink_if_exists(name + "baseB");
     
    17271727}
    17281728
    17291729void
    1730 FlintTable::commit(flint_revision_number_t revision)
     1730FlintTable::commit(flint_revision_number_t revision, int changes_fd)
    17311731{
    17321732    DEBUGCALL(DB, void, "FlintTable::commit", revision);
    17331733    Assert(writable);
     
    17891789        C[i].rewrite = false;
    17901790    }
    17911791
    1792     base.write_to_file(name + "base" + char(base_letter));
     1792    base.write_to_file(name + "base" + char(base_letter), changes_fd);
    17931793    base.commit();
    17941794
    17951795    read_root();
     
    17991799    seq_count = SEQ_START_POINT;
    18001800}
    18011801
     1802
    18021803void
     1804FlintTable::write_changed_blocks(int changes_fd)
     1805{
     1806    Assert(changes_fd >= 0);
     1807
     1808    // Compare the old and new bitmaps to find blocks which have changed, and
     1809    // write them to the file descriptor.
     1810
     1811    uint4 n = 0;
     1812    while (base.find_changed_block(&n)) {
     1813        // Write block n to the file.
     1814       
     1815        n += 1;
     1816    }
     1817}
     1818
     1819void
    18031820FlintTable::cancel()
    18041821{
    18051822    DEBUGCALL(DB, void, "FlintTable::cancel", "");
  • backends/flint/flint_table.h

     
    330330         *          be greater than the latest revision number (see
    331331         *          get_latest_revision_number()), or an exception will be
    332332         *          thrown.
     333         *
     334         *  @param changes_fd  The file descriptor to write changes to.  If -1,
     335         *          no changes will be written.
    333336         */
    334         void commit(flint_revision_number_t revision);
     337        void commit(flint_revision_number_t revision, int changes_fd);
    335338
     339        /** Append the list of blocks changed to a changeset file.
     340         *
     341         *  @param changes_fd  The file descriptor to write changes to.
     342         */
     343        void write_changed_blocks(int changes_fd);
     344
    336345        /** Cancel any outstanding changes.
    337346         *
    338347         *  This will discard any modifications which haven't been committed
  • backends/flint/flint_btreebase.h

     
    8686        }
    8787
    8888        /** Write the btree base file to disk. */
    89         void write_to_file(const std::string &filename);
     89        void write_to_file(const std::string &filename, int changes_fd);
    9090
    9191        /* Methods dealing with the bitmap */
    9292        /** true iff block n was free at the start of the transaction on
     
    9898
    9999        uint4 next_free_block();
    100100
     101        /** Find the first changed block at or after position *n.
     102         *
     103         *  Returns true if such a block was found, or false otherwise.
     104         */
     105        bool find_changed_block(uint4 * n);
     106
    101107        bool block_free_now(uint4 n);
    102108
    103109        void calculate_last_block();
  • backends/flint/flint_database.cc

     
    44 * Copyright 2001 Hein Ragas
    55 * Copyright 2002 Ananova Ltd
    66 * Copyright 2002,2003,2004,2005,2006,2007 Olly Betts
    7  * Copyright 2006 Richard Boulton
     7 * Copyright 2006,2008 Lemur Consulting Ltd
    88 *
    99 * This program is free software; you can redistribute it and/or
    1010 * modify it under the terms of the GNU General Public License as
     
    2727#include <xapian/error.h>
    2828
    2929#include "safeerrno.h"
     30#ifdef __WIN32__
     31# include "msvc_posix_wrapper.h"
     32#endif
    3033
     34#include "flint_io.h"
    3135#include "flint_database.h"
    3236#include "utils.h"
    3337#include "omdebug.h"
     
    6771// in the term).
    6872#define MAX_SAFE_TERM_LENGTH 245
    6973
     74// Magic string used to recognise a changeset file.
     75#define CHANGES_MAGIC_STRING "FlintChanges"
     76
     77// The current version of changeset files.
     78// 1  - initial implementation
     79#define CHANGES_VERSION 1u
     80
    7081// Magic key in the postlist table (which corresponds to an invalid docid) is
    7182// used to store the next free docid and total length of all documents.
    7283static const string METAINFO_KEY("", 1);
    7384
     85/** Delete file, throwing an error if we can't delete it (but not if it
     86 *  doesn't exist).
     87 */
     88static void
     89sys_unlink_if_exists(const string & filename)
     90{
     91#ifdef __WIN32__
     92    if (msvc_posix_unlink(filename.c_str()) == -1) {
     93#else
     94    if (unlink(filename) == -1) {
     95#endif
     96        if (errno == ENOENT) return;
     97        throw Xapian::DatabaseError("Can't delete file: `" + filename +
     98                              "': " + strerror(errno));
     99    }
     100}
     101
    74102/* This finds the tables, opens them at consistent revisions, manages
    75103 * determining the current and next revision numbers, and stores handles
    76104 * to the tables.
     
    337365FlintDatabase::set_revision_number(flint_revision_number_t new_revision)
    338366{
    339367    DEBUGCALL(DB, void, "FlintDatabase::set_revision_number", new_revision);
    340     postlist_table.commit(new_revision);
    341     position_table.commit(new_revision);
    342     termlist_table.commit(new_revision);
    343     value_table.commit(new_revision);
    344     synonym_table.commit(new_revision);
    345     spelling_table.commit(new_revision);
    346     record_table.commit(new_revision);
     368
     369    int changes_fd = -1;
     370    string changes_name;
     371
     372    if (1) { // FIXME - toggle writing of changesets somehow.
     373        changes_name = db_dir + "/changes" + om_tostring(new_revision);
     374#ifdef __WIN32__
     375        changes_fd = msvc_posix_open(changes_name.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_BINARY);
     376#else
     377        changes_fd = open(changes_name.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0666);
     378#endif
     379        if (changes_fd < 0) {
     380            string message = string("Couldn't open changeset ")
     381                    + changes_name + " to write: " + strerror(errno);
     382            throw Xapian::DatabaseOpeningError(message);
     383        }
     384    }
     385
     386    try {
     387        fdcloser closefd(changes_fd);
     388        if (changes_fd >= 0) {
     389            string buf;
     390            flint_revision_number_t old_revision =
     391                    record_table.get_open_revision_number();
     392            buf += CHANGES_MAGIC_STRING;
     393            buf += pack_uint(CHANGES_VERSION);
     394            buf += pack_uint(old_revision);
     395            buf += pack_uint(new_revision);
     396
     397            // FIXME - if DANGEROUS mode is in use, this should contain pack_uint(1u)
     398            buf += pack_uint(0u); // Changes can be applied to a live database.
     399
     400            flint_io_write(changes_fd, buf.data(), buf.size());
     401
     402            // Write the changes to the blocks in the tables.  Do the postlist
     403            // table last, so that ends up cached the most, if the cache
     404            // available is limited.  Do the position and value tables just
     405            // before that, because they're also critical to search speed.
     406            termlist_table.write_changed_blocks(changes_fd);
     407            synonym_table.write_changed_blocks(changes_fd);
     408            spelling_table.write_changed_blocks(changes_fd);
     409            record_table.write_changed_blocks(changes_fd);
     410            position_table.write_changed_blocks(changes_fd);
     411            value_table.write_changed_blocks(changes_fd);
     412            postlist_table.write_changed_blocks(changes_fd);
     413        }
     414
     415        postlist_table.commit(new_revision, changes_fd);
     416        position_table.commit(new_revision, changes_fd);
     417        termlist_table.commit(new_revision, changes_fd);
     418        value_table.commit(new_revision, changes_fd);
     419        synonym_table.commit(new_revision, changes_fd);
     420        spelling_table.commit(new_revision, changes_fd);
     421        record_table.commit(new_revision, changes_fd);
     422
     423        if (changes_fd >= 0) {
     424            string buf;
     425            buf += '\0';
     426            buf += pack_uint(new_revision);
     427            flint_io_write(changes_fd, buf.data(), buf.size());
     428            // FIXME - should really be calling flint_io_sync() on the changes
     429            // file before writing the new base for the record table, to ensure
     430            // that the changes file doesn't get lost (or, worse, left only
     431            // partially recorded on disk) if the system fails after the sync
     432            // of the record table, but before the sync of the changes file.
     433            flint_io_sync(changes_fd);
     434        }
     435    } catch(...) {
     436        // Remove the changeset, if there was one.
     437        if (changes_fd >= 0) {
     438            sys_unlink_if_exists(changes_name);
     439        }
     440
     441        throw;
     442    }
    347443}
    348444
    349445void
  • backends/flint/flint_synonym.h

     
    102102        FlintTable::set_block_size(blocksize);
    103103    }
    104104
    105     void commit(flint_revision_number_t revision) {
     105    void commit(flint_revision_number_t revision, int changes_fd) {
    106106        merge_changes();
    107         FlintTable::commit(revision);
     107        FlintTable::commit(revision, changes_fd);
    108108    }
    109109
    110110    void cancel() {
  • backends/flint/flint_btreebase.cc

     
    277277}
    278278
    279279void
    280 FlintTable_base::write_to_file(const string &filename)
     280FlintTable_base::write_to_file(const string &filename, int changes_fd)
    281281{
    282282    calculate_last_block();
    283283
     
    310310    }
    311311    fdcloser closefd(h);
    312312
     313    if (changes_fd >= 0) {
     314        string changesbuf;
     315        changesbuf += pack_uint(1u); // Indicates that this item is a base file.
     316        changesbuf += filename[filename.size() - 1]; // The letter of the base file.
     317        changesbuf += pack_uint(buf.size());
     318        flint_io_write(changes_fd, changesbuf.data(), changesbuf.size());
     319        flint_io_write(changes_fd, buf.data(), buf.size());
     320    }
     321
    313322    flint_io_write(h, buf.data(), buf.size());
    314323    flint_io_sync(h);
    315324}
     
    412421    return n;
    413422}
    414423
     424
    415425bool
     426FlintTable_base::find_changed_block(uint4 * n)
     427{
     428    // Search for a block which was free at the start of the transaction, but
     429    // isn't now.
     430    (void) n;
     431    // FIXME - implement
     432    return false;
     433}
     434
     435bool
    416436FlintTable_base::block_free_now(uint4 n)
    417437{
    418438    uint4 i = n / CHAR_BIT;