Opened 16 years ago

Closed 15 years ago

#284 closed defect (fixed)

occasional DatabaseModifiedErrors

Reported by: Markus Woerle Owned by: Olly Betts
Priority: normal Milestone: 1.0.14
Component: Search::Xapian Version: 1.0.12
Severity: normal Keywords:
Cc: mrks@… Blocked By:
Blocking: Operating System: All

Description

I use xapian-core-1.0.7 with the corresponding perl bindings. I run a 1 writer/N reader setup, and I do reopen() a database-handle before each query. Nevertheless I casually get DatabaseModifiedErrors.

This is what I found out so far:

  • The errors occurs after explicit flushing my most frequented index. The error does less often occur, if I do a sleep(1) after each explicit flush() before applying no changes (without flush) to the index, and it never occured so far with a sleep(4). This is my workaround.
  • I already set XAPIAN_FLUSH_THRESHOLD to a large value (100000).
  • I patched the xapian-core lib to log all calls of FlintDatabase::set_revision_number(), and the throw-points of the XapianModifiedErrors, which turned out that the exception gets thrown in FlintTable::set_overwritten().
  • I patched again to get the caller and found out that set_overwritten() got called by FlintTable::block_to_cursor(), which I patched again to expose the condions:
if (REVISION(p) > REVISION(C_[j + 1].p)) {
 fprintf(stderr, "set_overwritten: from block_to_cursor() %d > %d\n", REVISION(p), REVISION(C_[j + 1].p));
 set_overwritten();
 return;
}

and it turned out:

set_overwritten: from block_to_cursor() 10194 > 10192
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError'
(...)
set_overwritten: from block_to_cursor() 10195 > 10193
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError' set_overwritten: from block_to_cursor() 10195 > 10193
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError'
(...)
set_overwritten: from block_to_cursor 10199 > 10197
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError'
set_overwritten: from block_to_cursor 10199 > 10197
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError'

I originally tested this with xapian-1.0.6, but it also occurs in 1.0.7.

I run xapian on Ubuntu Linux 8.04 (Hardy) with a 2.6.24-19-server kernel and an ext 3 file filesystem. The machine is an IBM x3650 with 40 GB RAM, and a ServeRAID-8k Controller running a Raid 10 over 6 SAS-Disks.

My most frequented index (the one that drops the exceptions) contains about 850.000 documents, needs 11 Gb of disk space, gets 5-15 updates per second, and about 20-25 search hits per second. I flush() this index every 10 minutes (which takes about 60-100 seconds + 4 seconds workaround delay ;-)

Attachments (2)

reader.py (486 bytes ) - added by Richard Boulton 15 years ago.
Re-implementation of reader.pl in python
writer.py (763 bytes ) - added by Richard Boulton 15 years ago.
Re-implementation of writer.pl in python

Download all attachments as: .zip

Change History (24)

comment:1 by Olly Betts, 16 years ago

Milestone: 1.0.8
Status: newassigned
Version: SVN HEAD1.0.7

I wonder if the issue could be that writes are being reordered aggressively by the disk subsystem such that we pick up some inconsistent intermediate state.

Perhaps we need to flush something like this to avoid such issues:

  • fdatasync() (or equivalent)
  • write new root block
  • fdatasync() (or equivalent)
  • update basefiles

Currently we do the above but without the first fdatasync().

I'll put together a patch for you to test which does this, or perhaps something which shows the timings if I can think of a good way to do that...

comment:2 by Richard Boulton, 16 years ago

The extra fdatasync() is worth a try, though I can't quite see how that can be the explanation, because the new root block shouldn't be being accessed until the new basefile has been written (or at least, not by a reader open at the old revision.

In the error printfs, I notice that the version numbers are 2 apart (eg "10194 > 10192"). So, this means a reader open at rev 10192 is finding that a block it wanted to read is now part of revision 10194. This looks very much as if the reopen has failed to work, somehow. Do we know what the reader is doing when the error occurs? One possibility that occurred to me is that the reopen() call itself is throwing the error - perhaps it's trying to read a block as part of the cleanup of the old revision or something (though I haven't found anything in the code to reflect that).

comment:3 by Markus Woerle, 16 years ago

Do we know what the reader is doing when the error occurs?

The exception occurs during my (perl-API) get_mset() call.

One possibility that occurred to me is that the reopen() call itself is throwing the error.

Not in my code anyway. My (perl-API) reopen() returns before the exception gets thrown.

I wrapped my calls and logged as follows:

(6223) reopen() begin
(6223) reopen() end
(6223) get_mset() begin
(6223) get_mset() end
(6223) reopen() begin
(6223) reopen() end
(6223) get_mset() begin
(6223) Xapian::DatabaseModifiedError: in set_overwritten    # <- this line comes from my patched xapian lib
terminate called after throwing an instance of 'Xapian::DatabaseModifiedError'

(The number in brackets is the PID of the process, to be sure that nothing gets mixed up.)

comment:4 by Olly Betts, 16 years ago

Milestone: 1.0.81.0.9
Summary: occasional !DatabaseModifiedErrorsoccasional DatabaseModifiedErrors

Bumping to 1.0.9. We need to get to the bottom of this, but we also need to get 1.0.8 out.

comment:5 by Olly Betts, 15 years ago

Just looking at bugs marked for 1.0.9. I suspect we'll have to bump this one again for now, but I've just been trying to break it down a bit...

It seems to me that one of the following is probably true:

  • Our strategy for moving a database to a new revision is flawed.
  • We've incorrectly implemented a valid strategy.
  • The OS (or possibly disk hardware) doesn't correctly implement some guarantee (of POSIX or some other relevant standard) which our strategy relies on.
  • The compiler miscompiled the code (last refuge excuse of the desperate bug-chaser!)

So, does that help? It doesn't magically solve it, but perhaps we can narrow things down.

For example, can we reproduce this on another system? (That would make it easier to poke with a debugger too!)

Does a different GCC version make any difference if it's easy to rebuild with one?

comment:6 by Markus Woerle, 15 years ago

  • The OS (or possibly disk hardware) doesn't correctly implement some guarantee

I have no idea how to test this. If you do let me know. In addition to the information above: I use the deadline scheduler (default in ubuntu server kernels), and have the write-cache enabled in my ServeRAID-8k controller (as it is backup-battery powered, this should not make any differences, even in case of power outage).

  • The compiler miscompiled the code (last refuge excuse of the desperate bug-chaser!)

I just compiled it with gcc/g++ 4.1 instead of 4.2 - with no effect.

For example, can we reproduce this on another system? (That would make it easier to poke with a debugger too!)

I tried to reproduce it with some generated indices in different sizes with no success so far. Its hard to simulate the actual load which the live-system gets. I am also not able to move the whole setup to another physical machine without risk and downtime.

I'll keep trying.

comment:7 by Olly Betts, 15 years ago

I'm not sure how to check for OS/disk hardware bugs either. I did notice this recently:

http://google-opensource.blogspot.com/2008/08/linux-disk-scheduler-benchmarking.html

Which mentions they found and fixed bugs in the deadline scheduler while benchmarking.

It might be interesting to see if you can reproduce the problem with other schedulers or with the hardware write cache off, if it's possible to test that without disrupting the live system.

Good to have rule out miscompilation (well probably anyway, it could be miscompiled by both versions, but I don't think this is a likely explanation anyway).

I'm going to skip this for 1.0.9, since we don't yet have a grasp of what the cause is, so the only other option is to delay 1.0.9 until we do, which doesn't help anyone.

comment:8 by Olly Betts, 15 years ago

Milestone: 1.0.91.0.10

comment:9 by Olly Betts, 15 years ago

Milestone: 1.0.101.0.11

comment:10 by Olly Betts, 15 years ago

Milestone: 1.0.111.1.1

I think that given we've not really got a handle on the cause, and there's an easy workaround (sleeping for 4 seconds after flush()), I'm going to stop optimistically setting the milestone to 1.0.x and move it to 1.1.1 (the current bucket for "stuff in the 1.1.x series after 1.1.0").

If you upgrade to a newer Xapian release, or a newer Ubuntu version, please do let us know if the problem still occurs.

in reply to:  10 comment:11 by Markus Woerle, 15 years ago

I recently refactured our indexing/searching topology to support distribution as well as replication, but unfortunately, after that, our sleep(4)-workaround did not work reliably anymore. I decided to use the new possibilities to replicate our profile-index to another machine with completely different (disk/raid-) hardware, but the issue occured again. I spend a few days to dig again, and finally I managed to reproduce the bug with two little perlscripts that are indepentend from our software or data!

These scripts are: http://5nord.org/~mrks/xapian/writer.pl and http://5nord.org/~mrks/xapian/reader.pl

Steps to reproduce:

1) first run writer.pl - it will build and populate an index in /tmp/test_index with 500.000 documents, each with one random 10-character-value ([A-Z]) for sorting, and two random terms XFOO... and XBAR... . Beyond, writer.pl manually flush()es every 50.000 documents. Wait until writer.pl finished this task because the bug will more likely occur while documents are replaced in an existing index.

2) run reader.pl - it will search the index by (endlessly) querying "XFOOone AND XBARtwo" and sorting by value_then_relevance on value 0. I suggest to run at least three instances at the same time to improve the chance of the bug to occur. It will also occur with one single instance, but you might have to wait a longer period ot time.

3) at the same time, run writer.pl, which rebuilds the index by replacing documents

The reader.pl processes will occasionally terminate with "terminate called after throwing an instance of 'Xapian::DatabaseModifiedError'" and SIGABORTs, while writer.pl is running. Beyond I recognized a increased chance to ABORT at the time when writer.pl finishes and exits.

This is reproducible to me on different machines with different architectures, e.g. it occurs on our live-systems (e.g. IBM x3650 with Ubuntu Hardy) as well as on my MacBook Pro, with MacOS 10.5.6.

I do not know if these scripts are exactly minimal to reproduce the bug, and I have actually no idea whats happening there, but I assume it might have something to do with the sorting by value, because if you leave that, it won't occur.

comment:12 by Olly Betts, 15 years ago

Version: 1.0.71.0.11

Ecellent, this reproduces for me with 3 readers, using 1.0 branch as we approach 1.0.12.

I'll try it on trunk next.

by Richard Boulton, 15 years ago

Attachment: reader.py added

Re-implementation of reader.pl in python

by Richard Boulton, 15 years ago

Attachment: writer.py added

Re-implementation of writer.pl in python

comment:13 by Richard Boulton, 15 years ago

I've attached a reimplementation of this in python (since I didn't have the perl bindings compiled), which failed for me as described when tested against trunk.

comment:14 by Olly Betts, 15 years ago

Version: 1.0.11SVN trunk

And this fails on trunk too.

Adding the values isn't required, but sorting by the (not added) values seems to be - without that the searches are quite a bit quicker, so that may be why...

comment:15 by Richard Boulton, 15 years ago

I believe that our algorithm is at fault. Here's my reasoning:

Suppose we start at revision 7. All readers are open at revision 7.

The writer, as soon as any modifications are made, writes these modified blocks with a revision number of 8. This is fine, since only blocks which are not used in revision 7 are modified. When the commit happens, the writer writes all modified blocks as revision 8.

At this point, any new readers will open revision 8. However, old readers (even if they called reopen() just before the commit), will still be on revision 7.

After the commit, any further modifications will be written as revision 9. If a revision 7 reader sees one of these, it will raise the error we've seen. However, the writer is now only guaranteeing not to modify blocks used in revision 8, so this is quite possible.

In other words, the algorithm only guarantees that reopen() will prevent DatabaseModifiedErrors being raised if the writer holds off writing blocks to the database for a period of time equal to, at least, the longest length that readers are reading from the database after calling reopen().

comment:16 by Richard Boulton, 15 years ago

The obvious possible fixes are:

i) Have up to _three_ old revisions in use, rather than just 2. This would extend the period of time until reopen() is needed to the intended length of "the time between commits". However, it would result in database bloat increasing, since the number of blocks which cannot be used between commits would increase.

ii) Implementing read locks and ensuring that we never overwrite a block used by a revision which is locked. This would certainly be cool, but is lots of work.

Any others?

comment:17 by Olly Betts, 15 years ago

Well, there's "call reopen when you get DatabaseModifiedError", which is what we recommended doing, but isn't currently possible in Perl in 1.0.x. 1.1.0 translates C++ to Perl exceptions (at least in the cases which have been implemented), and we could add a simple patch to handle at least DatabaseModifiedError in 1.0.x.

Exception handling in C++ is typically rather costly though, so it isn't ideal that this is almost an expected path.

Reader locks would also increase DB bloat by forcing older revisions to be kept. The advantage is that we don't need to if no readers are active.

Not sure I'm keen to fork a child process for every Database opened though. We could perhaps just have a single child process which holds all locks (or all read-locks) of a process, but it's trickier to handle, and doesn't actually help in the case of a CGI process like Omega - there's still one extra fork over not having read-locks. I hate fcntl's semantics...

comment:18 by Olly Betts, 15 years ago

Milestone: 1.1.11.0.12
Operating System: LinuxAll
Resolution: fixed
Status: assignedclosed
Version: SVN trunk1.0.11

In branches/1.0, r12387 addresses this issue by adding exception handling for the Enquire methods which need it (and one Database method which was missing it). With this change, user code in Perl can now use the technique we suggest for other languages - i.e. catch the DatabaseModified exception, call reopen(), and retry the operation.

In trunk, there's already exception handling for these methods.

So I think we can close this ticket at last.

in reply to:  18 comment:19 by Markus Woerle, 15 years ago

Resolution: fixed
Status: closedreopened

I think there is at least one try/catch block missing (I rarely still get uncatched DatabaseModifiedErrors), namely around Document::get_data() in XS/Document.xs

string
Document::get_data()
    CODE:
    try {
        RETVAL = THIS->get_data();
    } catch (const Error &error) {
        croak( "Exception: %s", error.get_msg().c_str() );
    }
    OUTPUT:
        RETVAL

Could you please include it within the next version?

comment:20 by Olly Betts, 15 years ago

Component: Backend-FlintSearch::Xapian
Milestone: 1.0.121.1.1

The actually handling is different, but this isn't handled in trunk either, so setting milestone:1.1.1 initially.

comment:21 by Olly Betts, 15 years ago

Milestone: 1.1.11.0.14
Version: 1.0.111.0.12

Fixed in trunk r12857 - needs to be fixed in 1.0 branch too.

comment:22 by Olly Betts, 15 years ago

Resolution: fixed
Status: reopenedclosed

Backported for 1.0.14.0 in r13013.

Note: See TracTickets for help on using tickets.