I have a setup where I would like to be able to perform index updates one one
master database, and then replicate this database to multiple client machines
for searching.
I've experimented with using an NFS setup for this, with the database kept local
on the index server and mounted remotely on the search clients, hoping that the
client machines would keep enough of the database cached that the network
traffic would not slow down searches too much. However, this method doesn't
work satisfactorily because the NFS protocol doesn't allow NFS clients to get
information about file updates other than by polling the mtime of a file:
therefore, whenever the index is updated, any cached pages from the database are
discarded. This leads to many very slow searches.
For now, I'm looking at setting up a system to take snapshots of databases using
filesystem features (eg, the snapshot functionality provided by ZFS) and then
using xdelta to calculate the differences between the databases, transferring
the differences manually, and then applying the differences to the database on
the search machines.
However, this approach has two major drawbacks: firstly, it depends on
filesystem specific features (to take filesystem snapshots - a standard file
copy could be used, but this would have poor cache performance, which is exactly
what we're trying to avoid). Secondly, it requires the whole database to be
traversed on the index machine to calculate the binary diffs. This is
undesirable because it imposes unnecessary load on the index machine.
Instead, I would like to have a hook into flint which writes out a list of the
modified btree pages, so that these can then be distributed to the search
servers. If this information was written to a log file, together with the
points at which fsync were called, and with details of the changes made to the
base files, this log file could be transferred to the search machines, and could
be replayed there, with minimal work required there.
Work in progress patch