Opened 17 years ago

Closed 17 years ago

Last modified 17 years ago

#109 closed defect (released)

Flint fails to open a database if some base files from one revision are missing

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone:
Component: Backend-Flint Version: SVN trunk
Severity: normal Keywords:
Cc: Richard Boulton, Charlie Hull Blocked By:
Blocking: Operating System: All

Description

Son-of bug#108

This python script from Richard reproduces the problem (works on Linux at least):

http://www.xapian.org/cgi-bin/bugzilla/attachment.cgi?id=35&action=view

Richard wrote: "The script [...] exhibits a [...] problem with Xapian's transaction code - in that it fails to recover from a failed transaction, and ends up corrupting the database (ie, it results in a database which can't be opened because it has no consistent set of base files). Obviously, this could happen in other circumstances than the windows specific situation this bug report is about, so this needs further investigation and fixing - once the python script has been converted to a normal test case, it should be reasonably easy to fix. As an experiment, I tried changing the "sys_unlink()" call in flint_table.cc to "sys_unlink_if_exists()", and this seems to fix the problem (in that the transaction which the python script simulates failure for is rolled back, and future transactions proceed as normal). However, I'd not want to apply that change without getting a proper understanding of what is going on: obviously, we need to cope with transactions failing in all sorts of ways."

It's quite likely that this problem affects quartz too.

Attachments (3)

flint-missing-base-test.patch (1.7 KB ) - added by Olly Betts 17 years ago.
Port of reproducing example to C++
flint-maybe-fix-base-recovery.patch (486 bytes ) - added by Olly Betts 17 years ago.
Possible fix
quartz-recovery.patch (2.8 KB ) - added by Olly Betts 17 years ago.
Fix for flint ported back to quartz

Download all attachments as: .zip

Change History (13)

comment:1 by Olly Betts, 17 years ago

I've had a look at the code, and have a theory - I think that the problem could be that we only set both_bases to true when opening a table. We don't set it to false in this case, so we could end up a true setting hanging around.

Richard: could you try the patch? I tried using your script, but I get complaints about a double free still (with or without this patch). I'll try to work out what's going on there...

by Olly Betts, 17 years ago

Port of reproducing example to C++

comment:2 by Olly Betts, 17 years ago

Status: newassigned

I think I see - your test deletes the baseB file while the table is open for writing. So the "both_bases" flag is out-of-date compared to what exists on disk.

So either the test doesn't simulate exactly what can actually happen, or we stop complaining if the delete fails because there's no such file.

I'll commit the "possible fix" change anyway, since it's an improvement on the current code, even if only because it's cleaner.

comment:3 by Olly Betts, 17 years ago

Note there's a typo in the C++ reproducer - "term_list" should be "termlist".

I've done further analysis - as the code is now:

  • The "both_bases" variable should be set correctly when the table is opened.
  • We only try to delete the old base file if "both_bases" is true. We set

"both_bases" to false if deletion succeeds. If it fails, an exception is thrown, and "both_bases" remains true. So we can only get "stuck" in the case where we actually delete the base, but "unlink" tells us it failed, which I don't think you're suggesting as happening (it's conceivable it might happen over NFS but I've not fully pursued that line of thought).

In bug#108, Richard said:

"Further, the database can then be left in an state in which flint cannot open the database for searching or indexing. We observed this happening in a database in which the record.baseB and termlist.baseB files had been removed. I'll file a bug about this specific issue when I have more information."

If you meant that even restarting the program didn't help, I can't reproduce this (with or without the "possible fix" patch). I wonder if the "baseA" revision was also broken in your case, perhaps as a consequence of the "double free" bug corrupting the heap.

But if this is going wrong, I think it must be in flint_database.cc, where we try to find a consistent revision to open.

Thoughts?

comment:4 by Richard Boulton, 17 years ago

Okay - I've finally got round to investigating this. With your C++ reproducing code, and current SVN head, I get:

./runtest ./apitest foo1 -v -b flint ... Running test: foo1...2 2

DatabaseErrorDatabaseError exception: Error seeking to block: Invalid argument

as you describe. However, the database has then become corrupted: if I then try to open it with delve, I get:

../examples/delve foo1/ Error opening database `foo1/': Cannot open tables at consistent revisions

Will investigate further, but there's definitely still a problem here.

comment:5 by Olly Betts, 17 years ago

I've just reworked the C++ reproducer to work for flint and quartz and committed it as test "crashrecovery1". It passes for me (but it didn't fail before either). With your recent commit, is this now resolved for Flint?

Is Quartz affected too? I've ported your flint fix to quartz, but I can't reproduce it for quartz either...

by Olly Betts, 17 years ago

Attachment: quartz-recovery.patch added

Possible fix

comment:6 by Olly Betts, 17 years ago

attachments.isobsolete: 01

comment:7 by Richard Boulton, 17 years ago

My recent commit only fixes the occasional reporting of DatabaseOpeningError when trying to open a database during heavy modification load. The issue of corruption of databases is still present.

Your crashrecovery1 test passes for me. However, this is different from the foo1 test you originally wrote, in that it reopens the WritableDatabase after deleting the baseB files before making further modifications. If I don't reopen the database here, the test fails, in the same way as foo1 was failing for me.

I believe that it is valid to continue using the open writable database, because my reading of the code leads me to believe that this is exactly what happens in the case that unlinking of one of the baseB files fails, as it used to on windows. But I could be wrong about that.

Quartz is affected too - in fact, with quartz a segfault happens. Your port of my patch to flint looks to be useful though, so I'll test whether it fixes the database opening problem with my loadtest.cc test, and commit it if so.

comment:8 by Olly Betts, 17 years ago

As I said in comment#4:

I think I see - your test deletes the baseB file while the table is open for writing. So the "both_bases" flag is out-of-date compared to what exists on disk.

So either the test doesn't simulate exactly what can actually happen, or we stop complaining if the delete fails because there's no such file.

I don't think the python script actually models what used to happen on Windows. The python script deletes base files behind Xapian's back, so that the "both_bases" flag is incorrect compared to what is on disk. However, on Windows the deletion would fail for one of the tables, but the tables where the deletion had succeeded would have correctly set "both_bases" to false, while the one which failed and those not yet processed would correctly have "both_bases" still set to true.

I don't believe we need to gracefully handle people randomly deleting our files underneath us! It can't be another Xapian writer deleting it, unless the locking mechanism fails.

The test I committed is intended to model the situation where an update is interrupted (by the process dying or the machine crashing) - I interpreted your original remarks as meaning it was possible for a database to get "stuck" such that it couldn't be reopened for reading or writing, but perhaps I've misunderstood...

comment:9 by Richard Boulton, 17 years ago

Your comments make sense - in that case, this bug is probably fixed.

Your patch to fix the problem in quartz should certainly be useful though - I'll check that it works as expected with loadtest, and then commit it.

comment:10 by Olly Betts, 17 years ago

Resolution: fixed
Status: assignedclosed

I think we can now close this bug then.

comment:11 by Olly Betts, 17 years ago

Operating System: All
Resolution: fixedreleased

This bug will be fixed in 0.9.10 (coming soon).

Note: See TracTickets for help on using tickets.