#109 closed defect (released)
Flint fails to open a database if some base files from one revision are missing
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Backend-Flint | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Richard Boulton, Charlie Hull | Blocked By: | |
Blocking: | Operating System: | All |
Description
Son-of bug#108
This python script from Richard reproduces the problem (works on Linux at least):
http://www.xapian.org/cgi-bin/bugzilla/attachment.cgi?id=35&action=view
Richard wrote: "The script [...] exhibits a [...] problem with Xapian's transaction code - in that it fails to recover from a failed transaction, and ends up corrupting the database (ie, it results in a database which can't be opened because it has no consistent set of base files). Obviously, this could happen in other circumstances than the windows specific situation this bug report is about, so this needs further investigation and fixing - once the python script has been converted to a normal test case, it should be reasonably easy to fix. As an experiment, I tried changing the "sys_unlink()" call in flint_table.cc to "sys_unlink_if_exists()", and this seems to fix the problem (in that the transaction which the python script simulates failure for is rolled back, and future transactions proceed as normal). However, I'd not want to apply that change without getting a proper understanding of what is going on: obviously, we need to cope with transactions failing in all sorts of ways."
It's quite likely that this problem affects quartz too.
Attachments (3)
Change History (13)
comment:1 by , 18 years ago
by , 18 years ago
Attachment: | flint-missing-base-test.patch added |
---|
Port of reproducing example to C++
comment:2 by , 18 years ago
Status: | new → assigned |
---|
I think I see - your test deletes the baseB file while the table is open for writing. So the "both_bases" flag is out-of-date compared to what exists on disk.
So either the test doesn't simulate exactly what can actually happen, or we stop complaining if the delete fails because there's no such file.
I'll commit the "possible fix" change anyway, since it's an improvement on the current code, even if only because it's cleaner.
comment:3 by , 18 years ago
Note there's a typo in the C++ reproducer - "term_list" should be "termlist".
I've done further analysis - as the code is now:
- The "both_bases" variable should be set correctly when the table is opened.
- We only try to delete the old base file if "both_bases" is true. We set
"both_bases" to false if deletion succeeds. If it fails, an exception is thrown, and "both_bases" remains true. So we can only get "stuck" in the case where we actually delete the base, but "unlink" tells us it failed, which I don't think you're suggesting as happening (it's conceivable it might happen over NFS but I've not fully pursued that line of thought).
In bug#108, Richard said:
"Further, the database can then be left in an state in which flint cannot open the database for searching or indexing. We observed this happening in a database in which the record.baseB and termlist.baseB files had been removed. I'll file a bug about this specific issue when I have more information."
If you meant that even restarting the program didn't help, I can't reproduce this (with or without the "possible fix" patch). I wonder if the "baseA" revision was also broken in your case, perhaps as a consequence of the "double free" bug corrupting the heap.
But if this is going wrong, I think it must be in flint_database.cc, where we try to find a consistent revision to open.
Thoughts?
comment:4 by , 18 years ago
Okay - I've finally got round to investigating this. With your C++ reproducing code, and current SVN head, I get:
./runtest ./apitest foo1 -v -b flint ... Running test: foo1...2 2
DatabaseErrorDatabaseError exception: Error seeking to block: Invalid argument
as you describe. However, the database has then become corrupted: if I then try to open it with delve, I get:
../examples/delve foo1/ Error opening database `foo1/': Cannot open tables at consistent revisions
Will investigate further, but there's definitely still a problem here.
comment:5 by , 18 years ago
I've just reworked the C++ reproducer to work for flint and quartz and committed it as test "crashrecovery1". It passes for me (but it didn't fail before either). With your recent commit, is this now resolved for Flint?
Is Quartz affected too? I've ported your flint fix to quartz, but I can't reproduce it for quartz either...
comment:6 by , 18 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:7 by , 18 years ago
My recent commit only fixes the occasional reporting of DatabaseOpeningError when trying to open a database during heavy modification load. The issue of corruption of databases is still present.
Your crashrecovery1 test passes for me. However, this is different from the foo1 test you originally wrote, in that it reopens the WritableDatabase after deleting the baseB files before making further modifications. If I don't reopen the database here, the test fails, in the same way as foo1 was failing for me.
I believe that it is valid to continue using the open writable database, because my reading of the code leads me to believe that this is exactly what happens in the case that unlinking of one of the baseB files fails, as it used to on windows. But I could be wrong about that.
Quartz is affected too - in fact, with quartz a segfault happens. Your port of my patch to flint looks to be useful though, so I'll test whether it fixes the database opening problem with my loadtest.cc test, and commit it if so.
comment:8 by , 18 years ago
As I said in comment#4:
I think I see - your test deletes the baseB file while the table is open for writing. So the "both_bases" flag is out-of-date compared to what exists on disk.
So either the test doesn't simulate exactly what can actually happen, or we stop complaining if the delete fails because there's no such file.
I don't think the python script actually models what used to happen on Windows. The python script deletes base files behind Xapian's back, so that the "both_bases" flag is incorrect compared to what is on disk. However, on Windows the deletion would fail for one of the tables, but the tables where the deletion had succeeded would have correctly set "both_bases" to false, while the one which failed and those not yet processed would correctly have "both_bases" still set to true.
I don't believe we need to gracefully handle people randomly deleting our files underneath us! It can't be another Xapian writer deleting it, unless the locking mechanism fails.
The test I committed is intended to model the situation where an update is interrupted (by the process dying or the machine crashing) - I interpreted your original remarks as meaning it was possible for a database to get "stuck" such that it couldn't be reopened for reading or writing, but perhaps I've misunderstood...
comment:9 by , 18 years ago
Your comments make sense - in that case, this bug is probably fixed.
Your patch to fix the problem in quartz should certainly be useful though - I'll check that it works as expected with loadtest, and then commit it.
comment:10 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
I think we can now close this bug then.
comment:11 by , 18 years ago
Operating System: | → All |
---|---|
Resolution: | fixed → released |
This bug will be fixed in 0.9.10 (coming soon).
I've had a look at the code, and have a theory - I think that the problem could be that we only set both_bases to true when opening a table. We don't set it to false in this case, so we could end up a true setting hanging around.
Richard: could you try the patch? I tried using your script, but I get complaints about a double free still (with or without this patch). I'll try to work out what's going on there...