Opened 16 years ago

Last modified 21 months ago

#333 assigned enhancement

Keep track of last modification time of database

Reported by: Richard Boulton Owned by: Vaibhav Kansagara
Priority: normal Milestone: 2.0.0
Component: Backend-Glass Version: git master
Severity: normal Keywords: GoodFirstBug
Cc: vaibhavkansagara249@… Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

It would be useful to be able to find out the time at which a modification was last made to a database. Two possible uses for this information:

  • supporting caching in an HTTP search server: for example, this would allow if-modified-since requests to be satisfied without having to perform a search (when not modified since).
  • displaying status information about a database to users.

To implement this, we'd need to store the modification time in the database somewhere, since we want to know the time of the last successful commit; this can't be derived by looking at the timestamp of the files, since it's possible that a commit nearly happened, but was cancelled or failed at the last minute (or, indeed, since we use fdatasync(), that a commit succeeded, but the timestamp information about the file modification never got written to disk due to a machine crash immediately after the commit). [This is true for chert and older backends, but for glass a new iamglass file is written to a temporary name and then renamed to atomically perform the commit, so its timestamp should be reliable even in the face of these concerns]

This could be emulated by storing the current time in a metadata value each time that flush() is called, but this wouldn't cover modifications made by implicit flushes. Currently, to implement this correctly on top of xapian, you'd need to use explicit transactions.

Change History (9)

comment:1 by Olly Betts, 16 years ago

Not sure I buy the status information use really - I kind of feel you're scratching around for a second use there!

But a way to determine if a previous query might return different results now would be handy. Not just for the HTTP if-modified-since, but for general caching of results - e.g. you could cache rendered output from Omega along with the timestamp. Hmm, actually just being able to read the current revision would be enough for that, but that doesn't seem to provide an easy way to implement the HTTP if-modified-since case. But using a timestamp is a bit brittle if the HTTP server and Xapian server are on different machines as there's potential for clock skew.

Acutally, isn't the timestamp of the newer base file of the record table a reliable timestamp, even in the face of failed updates? It won't be quite right after xapian-compact as record isn't updated last there, but it will just be a bit too old. The major advantage of this approach is that it will work for existing databases, both flint and chert.

comment:2 by Richard Boulton, 16 years ago

Actually, the status information request came from Tom, who wanted to be able to display it for a set of databases which he'd been working on - if I understood him correctly, there were several of them, with similar data but indexed in different ways, and displaying the last modification time would have been a useful check on which one was which. There are probably other, better, ways of making that clear (database uuids, for starters), but I didn't make the request up. ;-)

The timestamp of the newer base file might work normally - though it would change when databases were replicated.

Having an accurate timestamp would also be useful for HEAD requests - a search server could handle a HEAD request for a search without doing the search (and include a last modified time header in the response).

comment:3 by Olly Betts, 15 years ago

I think we want to avoid putting timestamp information into the database itself.

For the actual case here (checking manually which database was created when) ls -lt /PATH/TO/DB|head works well.

For HTTP "If-Modified-Since" and similar cases, I think the best option is a method which (in a backend-specific way) gets a suitable timestamp from the database files, or else returns (time_t)-1. Copying or replication may touch that timestamp, but I don't think that's much of an issue for the uses we're anticipating (and replication could easily replicate the file timestamps too).

The big plus of this approach is it works for all existing disk-based databases, and could be implemented if someone wrote a backend for a "foreign" format (say, a Lucene backend).

comment:4 by Dan, 14 years ago

I'm going to chime and point out that even if we are replicating, the changesets will be also be created with monotonically increasing timestamps. So even if they are not the same as the main server, they should be in order and thus be of use for testing if a database has been modified since a given date. I think if we do start to store timestamps we'll have to deal with locality issues to get it right.

comment:5 by Olly Betts, 11 years ago

Milestone: 1.3.x

comment:6 by Olly Betts, 10 years ago

Milestone: 1.3.x1.4.x

Reviewing this, I'm still leaning towards file timestamps as the best approach overall. If the limitations prove to be a problem, we could always switch to storing an explicit timestamp later.

I think this probably should work on a Database object (rather than just being some static method taking a path).

I'm not entirely sure what the returned type should be. The obvious choices are time_t (1 second resolution), struct timeval (1 µs resolution), struct timespec (1 ns resolution) or double (which should be able to hold any 64-bit integer value exactly, but seems a little problematic). POSIX and C11 specify struct timespec, which makes it a promising option. It would mean pulling in an external system header though.

Getting the timestamp:

  • For chert, we can look at the base file we used to open the record table.
  • For glass, we can just use the timestamp of the version file.
  • For inmemory, we'd probably just return unknown.
  • For remote, this would need a protocol tweak (but could return unknown for now).
  • For multidatabases, we'd return max() for the subdatabases (or unknown if any are unknown).
  • For writable databases, I'm not really sure what's best. The options I can see are:
    • Define that's this is the last time a commit happened.
    • Update a timestamp on each modification (which is an extra overhead to maintain data we will probably never use for a case where the timestamp probably isn't very important anyway).
    • Return unknown.
    • Return the current time.

I don't see this as a blocker for 1.4.0, as it doesn't need incompatible changes, and nobody's asked about it since the initial request.

comment:7 by Olly Betts, 7 years ago

Component: Backend-ChertBackend-Glass
Description: modified (diff)
Keywords: GoodFirstBug added

comment:8 by Vaibhav Kansagara, 6 years ago

Cc: vaibhavkansagara249@… added
Owner: changed from Olly Betts to Vaibhav Kansagara
Status: newassigned

comment:9 by Olly Betts, 21 months ago

Milestone: 1.4.x2.0.0
Version: SVN trunkgit master
Note: See TracTickets for help on using tickets.