Opened 16 years ago
Closed 16 years ago
#317 closed defect (fixed)
Database corruption after disk-full error
Reported by: | Richard Boulton | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | normal | Milestone: | 1.0.10 |
Component: | Backend-Flint | Version: | 1.0.7 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
I've recently been testing the behaviour of xapian when the disk becomes full, after reports of corruption at a customer site in this situation, by performing some indexing to a database in a small partition.
The key seems to be that, if a WritableDatabase is re-used after an operation with it has encountered an IOError, all sorts of corruption is possible. I've got a python script which repeatably produces a corrupt database when run in a small partiton, which I'll attach here shortly. However, the exact mode of failure is very sensitive to the initial amount of space available.
I've only tested this with the flint backend so far, and only with xapian 1.0.7 (the version in ubuntu hardy) but it's likely that chert and more recent xapian's have a similar problem.
Attachments (2)
Change History (8)
by , 16 years ago
Attachment: | xaptest.py added |
---|
comment:1 by , 16 years ago
OK, so, if I run, with SVN trunk, the attached xaptest.py script in an ext3 partition with slightly over 100Mb of free space available (102754kb, to be precise), I get 6000 documents successfully indexed, and then get errors due to running out of space, as follows (end of log shown - before this it just counts up to 6600):
6610 6620 6630 6640 6650 6660 6670 6680 6690 Error writing to file (No space left on device) Error writing to file (No space left on device) Traceback (most recent call last): File "/home/richard/xaptest.py", line 17, in <module> db.add_document(doc) xapian.DatabaseError: Error writing to file (No space left on device)
The process terminates at this point, and the database is left in a consistent state.
However, if I then re-run the xaptest.py script with the same database, I get:
0 Modifications failed (DatabaseError: Error writing to file (No space left on device)), and cannot set consistent table revision numbers: Couldn't reread base A 10 20 30 40 [...] 650 660 670 680 690 Couldn't reread base A Traceback (most recent call last): File "/home/richard/xaptest.py", line 17, in <module> db.add_document(doc) xapian.DatabaseCorruptError: Couldn't reread base A Segmentation fault
(Where [...] was the counts from 50 to 640)
The database is left in a corrupt state: here's a listing of the database directory:
total 102778 -rw-r--r-- 1 richard richard 0 2008-12-18 01:23 flintlock -rw-r--r-- 1 richard richard 12 2008-12-18 01:22 iamflint -rw-r--r-- 1 richard richard 951 2008-12-18 01:22 position.baseB -rw-r--r-- 1 richard richard 68071424 2008-12-18 01:23 position.DB -rw-r--r-- 1 richard richard 35815424 2008-12-18 01:23 postlist.DB -rw-r--r-- 1 richard richard 0 2008-12-18 01:23 postlist.tmp -rw-r--r-- 1 richard richard 16 2008-12-18 01:22 record.baseB -rw-r--r-- 1 richard richard 98304 2008-12-18 01:23 record.DB -rw-r--r-- 1 richard richard 25 2008-12-18 01:22 termlist.baseB -rw-r--r-- 1 richard richard 745472 2008-12-18 01:23 termlist.DB -rw-r--r-- 1 richard richard 16 2008-12-18 01:22 uuid -rw-r--r-- 1 richard richard 16 2008-12-18 01:22 value.baseB -rw-r--r-- 1 richard richard 98304 2008-12-18 01:23 value.DB
... we can see that there is no base file for the postlist table.
This corruption doesn't happen if I change the xaptest.py script so that it aborts after an error.
comment:2 by , 16 years ago
Applying fix1.patch also changes the behaviour of the test run described, such that it doesn't result in a corrupt database. Instead, it produces the output:
0 Modifications failed (DatabaseError: Error writing to file (No space left on device)), and cannot set consistent table revision numbers: Couldn't reread base A Couldn't reread base A Traceback (most recent call last): File "/home/richard/xaptest.py", line 17, in <module> db.add_document(doc) xapian.DatabaseCorruptError: Couldn't reread base A
And leaves the database listing as:
total 102770 -rw-r--r-- 1 richard richard 0 2008-12-18 01:40 flintlock -rw-r--r-- 1 richard richard 12 2008-12-18 01:38 iamflint -rw-r--r-- 1 richard richard 951 2008-12-18 01:38 position.baseB -rw-r--r-- 1 richard richard 68063232 2008-12-18 01:40 position.DB -rw-r--r-- 1 richard richard 564 2008-12-18 01:38 postlist.baseB -rw-r--r-- 1 richard richard 35815424 2008-12-18 01:40 postlist.DB -rw-r--r-- 1 richard richard 0 2008-12-18 01:40 postlist.tmp -rw-r--r-- 1 richard richard 0 2008-12-18 01:39 record.baseA -rw-r--r-- 1 richard richard 16 2008-12-18 01:38 record.baseB -rw-r--r-- 1 richard richard 98304 2008-12-18 01:40 record.DB -rw-r--r-- 1 richard richard 25 2008-12-18 01:38 termlist.baseB -rw-r--r-- 1 richard richard 745472 2008-12-18 01:40 termlist.DB -rw-r--r-- 1 richard richard 16 2008-12-18 01:38 uuid -rw-r--r-- 1 richard richard 16 2008-12-18 01:38 value.baseB -rw-r--r-- 1 richard richard 98304 2008-12-18 01:40 value.DB
The idea of this patch is that, by reopening all the tables, we get the database back into a consistent state, thus avoiding the risk of corruption. I'm not sure if this is fully safe, but it looks promising to me. Comments welcome...
comment:3 by , 16 years ago
Description: | modified (diff) |
---|
Hmm, what is actually the first system call which fails due to the full disk in this scenario?
Judging from the patch, it seems it is cancel() which must be throwing, but that only seems to read from disk so I'm not sure why it would fail in this situation...
The fix does indeed look promising, though I'm not totally sure it's exactly the right (or at least best) approach (perhaps we should be more fine-grained about where the exception happened), and also it seems that if we throw in the new code, we're probably no better off...
comment:4 by , 16 years ago
Here comes another long comment...
In the second run (ie, the one which ends in a segmentation fault), the first exception raised is (according to a run under gdb with "catch throw" set):
#0 0xb7a13e05 in __cxa_throw () from /usr/lib/libstdc++.so.6 #1 0xb7b0bf02 in flint_io_write (fd=12, p=0x81cbd5c "\b\005\200@\003\002�\004�7\223\"", n=563) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_io.cc:57 #2 0xb7af5c51 in FlintTable_base::write_to_file (this=0x81de684, filename=@0xbfd82b30, base_letter=65 'A', tablename=@0xbfd82b28, changes_fd=-1, changes_tail=0x0) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_btreebase.cc:333 #3 0xb7b23a89 in FlintTable::commit (this=0x81de658, revision=8, changes_fd=-1, changes_tail=0x0) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_table.cc:1790 #4 0xb7b00ce2 in FlintDatabase::set_revision_number (this=0x81de630, new_revision=8) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_database.cc:500 #5 0xb7b02000 in FlintDatabase::apply (this=0x81de630) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_database.cc:786 #6 0xb7b03956 in FlintWritableDatabase::flush (this=0x81de630) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_database.cc:1305 #7 0xb7a809a6 in Xapian::WritableDatabase::flush (this=0x81f9f50) at /home/richard/private/Working/xapian/working/xapian-core/api/omdatabase.cc:687 #8 0xb7c39bd7 in _wrap_WritableDatabase_flush (args=0xb7da814c) at modern/xapian_wrap.cc:26857 #9 0x0805cb97 in PyObject_Call () #10 0x080c7aa7 in PyEval_EvalFrameEx () #11 0x080cb1f7 in PyEval_EvalCodeEx () #12 0x080cb347 in PyEval_EvalCode () #13 0x080ea818 in PyRun_FileExFlags () #14 0x080eaab9 in PyRun_SimpleFileExFlags () #15 0x08059335 in Py_Main () #16 0x080587f2 in main ()
The second exception is:
#0 0xb7a13e05 in __cxa_throw () from /usr/lib/libstdc++.so.6 #1 0xb7b2348b in FlintTable::cancel (this=0x81de658) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_table.cc:1876 #2 0xb7afa5e1 in FlintDatabase::cancel (this=0x81de630) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_database.cc:800 #3 0xb7b064fe in FlintWritableDatabase::cancel (this=0x81de630) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_database.cc:1732 #4 0xb7b01b8d in FlintDatabase::modifications_failed (this=0x81de630, old_revision=7, new_revision=8, msg=@0xbfd82c60) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_database.cc:747 #5 0xb7b020f0 in FlintDatabase::apply (this=0x81de630) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_database.cc:788 #6 0xb7b03956 in FlintWritableDatabase::flush (this=0x81de630) at /home/richard/private/Working/xapian/working/xapian-core/backends/flint/flint_database.cc:1305 #7 0xb7a809a6 in Xapian::WritableDatabase::flush (this=0x81f9f50) at /home/richard/private/Working/xapian/working/xapian-core/api/omdatabase.cc:687 #8 0xb7c39bd7 in _wrap_WritableDatabase_flush (args=0xb7da814c) at modern/xapian_wrap.cc:26857 #9 0x0805cb97 in PyObject_Call () #10 0x080c7aa7 in PyEval_EvalFrameEx () #11 0x080cb1f7 in PyEval_EvalCodeEx () #12 0x080cb347 in PyEval_EvalCode () #13 0x080ea818 in PyRun_FileExFlags () #14 0x080eaab9 in PyRun_SimpleFileExFlags () #15 0x08059335 in Py_Main () #16 0x080587f2 in main ()
This means that the second error is due to cancel being unable to read the alternate base file, which doesn't exist.
I think the problem is that FlintTable::commit() sets the "base_letter" member to point to the alternate base before failing (and also sets various properties of the FlintBase object), and doesn't tidy itself up on exception. Therefore, after an exception, the FlintTable object is left in an inconsistent state, such that cancel() fails.
I think a good fix might be to respond to a failure in commit() by closing and reopening the tables, to ensure that they're in a consistent state. This could probably be done most neatly by a try-catch-throw around most of FlintTable::commit(), which calls FlintTable::close() on any error.
commit() is only called by FlintDatabase::set_revision_number(). Which in turn is called in several places:
- FlintDatabase constructor. If a failure occurs here, the exception is just propagated, and construction fails, so we don't need to do any cleanup.
- FlintDatabase::apply(). If a failure occurs here, modifications_failed() is called, which calls cancel() and then open_tables() (so tables closed by failure of commit() would be reopened here).
- FlintDatabase::modifications_failed(). If a failure occurs here, we'd probably be best to respond by putting the database into a "hard-closed" state. Olly suggested making a FlintTable::close() alternative which sets the handle to -2, and making that be a close state which we don't automatically open the tables from (instead, raise an exception). I think this would be a good response, in the circumstances.
comment:5 by , 16 years ago
Fixed in revision [11716] on trunk (for chert and flint), by the technique described at the end of the previous comment.
comment:6 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Backported to branches/1.0 as r11724.
Marking as fixed (I assume you didn't because this needed backporting - if not, reopen it!)
Test script