Opened 16 years ago

Last modified 12 months ago

#266 assigned enhancement

Add more control over the automatic flush and commit behaviour

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: Library API Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently, WritableDatabase::flush(), (which can only be called outside an explicit transaction), writes all the buffered changes to disk, and then performs a commit of the implicit transaction, making the changes visible to newly opened Database objects. This isn't really implied by the "flush" name - it would be more accurate to call the method "commit()", or similar. I suggest renaming flush to commit (but keeping flush() as an alias for backwards compatibility, for now).

Also, it would be helpful to be able to force the buffered changes to be written to disk, to reduce memory usage, but not to cause a commit(). This kind of flush would be useful when performing a large bulk change which the user wants to appear atomic. We can't call this method "flush()" because this would be a confusing change of behaviour. Olly suggests calling the method "minimise_memory", which seems fine to me, and describes the intention of the method, rather than its effects. This method would work the same whether called inside or outside an explicit transaction - it would write as many of the buffered changes as possible to disk, but not perform a commit.

Let's think about this for 1.1.0, though I'm not going to complain if this is bumped to later.

Attachments (2)

minmem.patch (6.6 KB ) - added by Richard Boulton 13 years ago.
Patch implementing this (but with insufficient tests)
set_flush_threshold.patch (22.1 KB ) - added by Richard Boulton 13 years ago.
Patch which adds set_flush_threshold() and set_commit_threshold() methods.

Download all attachments as: .zip

Change History (21)

comment:1 by Richard Boulton, 16 years ago

Blocking: 267 added

comment:2 by Olly Betts, 15 years ago

There's no patch yet, and we could make this change in 1.1.x, but it's not a big job, and I feel it would be better done for 1.1.0 if we're going to do it.

comment:3 by Olly Betts, 15 years ago

Blocking: 267 removed

(In #267) OK, let's go for that then. Seems I needs to unset blocked-by first.

comment:4 by Olly Betts, 15 years ago

Milestone: 1.1.01.1.1
Status: newassigned
Summary: Rename flush() to commit, add a pure minimise_memory() method.Add a minimise_memory() method.

I've added WritableDatabase::commit() as a new preferred alias for WritableDatabase::flush(), wrapped it in the bindings, and changed Omega and the bindings to use it.

Adding a new minimise_memory() (or some other name) method to flush but not commit is an API addition, so can wait for 1.1.x.

comment:5 by Olly Betts, 15 years ago

Milestone: 1.1.11.1.7

comment:6 by Olly Betts, 15 years ago

Priority: normallow

Could be bumped, but probably an easy change so leaving for 1.1.x for now.

comment:7 by Olly Betts, 15 years ago

Milestone: 1.1.71.2.0

API addition, bumping to keep on track for release.

by Richard Boulton, 13 years ago

Attachment: minmem.patch added

Patch implementing this (but with insufficient tests)

comment:8 by Richard Boulton, 13 years ago

Attached a patch, though it needs considerably more testing (I just added a couple of tests that it didn't break transaction semantics). Not sure how to test that it's functioning correctly.

comment:9 by Olly Betts, 13 years ago

Milestone: 1.2.x1.3.0
Priority: lownormal

I think the more important thing to test would be that the flushed() changes aren't committed - i.e. if you open the same database as a Xapian::Database, the document count doesn't increase. That's an obvious thing which could go wrong.

As I suggested on IRC, the easiest way to check it's actually doing what it should is to look at the size of postlist.DB to check it gets larger (at least for a suitably sized amount of data being flushed). Not terribly clean, but I don't think there's a better alternative.

I'm leaning towards thinking we should just call this method flush() - while it's a little confusing that the method name was previous used for a slightly different operation, we did add commit() as the preferred alias in 1.1.0 which is a whole unstable and stable release series ago, and if someone misses an update from flush() to commit(), then they'll still get the "write out changes" part, the changes just won't go live. It's not like calling flush() is a dangerous operation to accidentally invoke.

Otherwise, I'm not sure I like minimise_memory() so much - flush_changes() or force_flush() seem better options. Having flush in the name seems good, since the functionality is related to XAPIAN_FLUSH_THRESHOLD (though that also commits the changes when called outside a transaction).

Marking for 1.3.0.

comment:10 by Olly Betts, 13 years ago

An idea I mentioned on IRC (and Richard seemed to like) was to add a set_flush_threshold() method instead. If the threshold is reduced to below the currently batched size, we would then force a flush. Perhaps setting the threshold to 0 should not "stick", but otherwise we could return the old threshold allowing:

db.set_flush_threshold(db.set_flush_threshold(0));

Threshold would be in documents for now, but later we could add an optional second parameter to specify what it's in:

db.set_flush_threshold(1, db.GB);

There's also the issue that we currently flush and commit when $XAPIAN_FLUSH_THRESHOLD is reached, so perhaps we want a flag in there to specify if it's flush or commit, or perhaps a set_commit_threshold() method which controls that. You might want to flush every 10000 but commit every 100000 I guess. Or flush every 1GB but commit every 10000 documents.

comment:11 by Richard Boulton, 13 years ago

Summary: Add a minimise_memory() method.Add more control over the automatic flush and commit behaviour

After further discussion on IRC, there seems to be agreement that making a special "non-sticky" behaviour for 0 would be surprising, and might not be a good idea. Olly suggests that disabling implicit flushing is a bad idea, since certain operations cause implicit flushing anyway, and other tables still write directly to disk, so it doesn't control IO. I'm not convinced this is a real problem, as long as these limitations are documented.

I think there's a strong case for separating the commit and flush threshold - in RestPose, I currently force a commit after 5 seconds of inactivity, and otherwise commit using the automatic threshold. I'm quite happy to leave the flush threshold at an value which causes an automatic commit (though it is certainly good to be able to control that using a method, rather than having to call putenv, and then open the database again). However, when doing a batch update, I'd like to be able to avoid the overhead of the fsync() calls associated with commit until after the end of the batch update, so I would either set the threshold to a huge value, or to 0 if that disables automatic commits.

I've put together a patch which implements a set_flush_threshold() and set_commit_threshold() method, but again run out of time to properly test it. There's considerably more to test this time, too. The patch adds a separate counter for the number of changes since the last flush and since the last commit (replacing the single changes_count which previously counted both).

My main concern with the interface as implemented in the patch is how we handle units. While we could add an optional second parameter to specify units, there's no way to return the previous value of those units. We can still make the set_flush_threshold(set_flush_threshold(0)) idiom work sensibly if the unit-less version of the call leaves the units unchanged.

by Richard Boulton, 13 years ago

Attachment: set_flush_threshold.patch added

Patch which adds set_flush_threshold() and set_commit_threshold() methods.

comment:12 by James Aylett, 13 years ago

It might be cleaner to provide a typedef for "threshold", which can initially be numeric but later a struct that includes units (I'm thinking a little like fpos_t in fgetpos/fsetpos). We'd probably also want to overload set_FOO_threshold so there's one with separate formal parameters.

This preserves set_flush_threshold(set_flush_threshold(0)), allows for set_flush_threshold(int) to do the obvious thing, and provides a path to develop this further without more magical behaviour. Having set_flush_threshold(int) preserve the existing units is less obvious to me than having a default unit.

comment:13 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:14 by Olly Betts, 10 years ago

See also #195.

comment:15 by Olly Betts, 10 years ago

Milestone: 1.3.x1.3.3

comment:16 by Olly Betts, 9 years ago

Milestone: 1.3.31.3.4

comment:17 by Olly Betts, 8 years ago

Milestone: 1.3.41.4.x

I think we punt on this for 1.4.0. Pushing on with saner batch control is probably the best way to resolve this.

comment:18 by Olly Betts, 13 months ago

Milestone: 1.4.x1.5.0

[82569481a35b66635af393dfe8cf532d10710a01] finally removed flush() as a deprecated alias for commit(), which means we could reuse it to actually flush at some future point.

Control of the commit threshold via an API seems like it really should be possible (you can already set XAPIAN_FLUSH_THRESHOLD but that's clumsier than an API call, and getenv() has thread safety issues so it'd be better not to use it in libxapian).

Control of the real flush threshold (note XAPIAN_FLUSH_THRESHOLD really sets the commit threshold) by setting a number of changes seems the wrong approach though. Whether a change is in memory or on disk isn't something the API user should be worrying about. Some way to control the memory used seems more appropriate - how that relates to the number of changes varies widely depending on the size and nature of the changed documents.

Related, I have a patch which implements a "commit if we haven't for at least $TIME_INTERVAL and aren't in a transaction" feature, which maybe should be merged. The point I'm not entirely comfortable with is that it makes behaviour not solely dependent on inputs and so potentially debugging is harder. However, time-based is clearer useful (it gives at least a rough upper bound on how long it will be before a change is live in the search) and that seems to be an inherent property.

We should probably do at least some of this for 1.5.0.

comment:19 by Olly Betts, 12 months ago

I've closed #195 as a duplicate of this - it was specifically about an API to set the flush threshold, which is covered here.

Note: See TracTickets for help on using tickets.