Read block errors after reopen()
|Reported by:||Jean-Francois Dockes||Owned by:||Olly Betts|
Because of an ancient glitch (index flushes triggered by a query call), the Recoll indexing process uses 2 separately opened Database objects during indexing: one for updating the index, and the other one, readonly, for querying (mostly up-to-date signature values).
The queries using the readonly Database object get DatabaseModified exceptions, and call reopen() before retrying. This works well in general.
However, there are very rare cases where queries happening after the reopen() get other Xapian exceptions, like:
Expected block 0 to be level 1, not 0
Error reading block xxx: got end of file
I have also seen the process stuck in an infinite loop somewhere in the following call stack (probably near the bottom as I never get a shorter stack with CTL-C / continue inside gdb).
#0 __memcmp_ssse3 () at ../sysdeps/x86_64/multiarch/memcmp-ssse3.S:40 #1 0x00007fd81e3efbe0 in Key::operator<(Key) const () from /usr/lib/libxapian.so.22 #2 0x00007fd81e3efca8 in ChertTable::find_in_block(unsigned char const*, Key, bool, int) () from /usr/lib/libxapian.so.22 #3 0x00007fd81e3f0cc3 in ChertTable::find(Cursor*) const () from /usr/lib/libxapian.so.22 #4 0x00007fd81e3ccc69 in ChertCursor::find_entry(std::string const&) () from /usr/lib/libxapian.so.22 #5 0x00007fd81e3f7283 in ?? () from /usr/lib/libxapian.so.22 #6 0x00007fd81e3fb59b in ?? () from /usr/lib/libxapian.so.22 #7 0x00007fd81e3dc3fa in ?? () from /usr/lib/libxapian.so.22 #8 0x00007fd81e3538a6 in Xapian::Document::Internal::get_value(unsigned int) const () from /usr/lib/libxapian.so.22 #9 0x00007fd81e35390c in Xapian::Document::get_value(unsigned int) const () from /usr/lib/libxapian.so.22 #10 0x00007fd81f2f6bcb in Rcl::Db::needUpdate (this=0x1cee5f0, udi=..., sig=..., existed=existed@entry=0x7fff7edcddc8) at ../rcldb/rcldb.cpp:1762 ...
This all happens while the recoll 1.19.13 indexer is running SINGLE-THREADED, and I could reproduce it with Xapian 1.2.8 and 1.2.16
It happens that all known cases occurred on machines using SSDs, and it seems that the problem is easier to reproduce with a relatively slow CPU. I tried quite hard to reproduce the issue on a spinning disk system, with no luck. This might indicate that timing is somehow relevant. Also all cases were on Ubuntu, either 12.04 or 14.04
The original reporting user, who can reproduce the issue quite frequently, uses a 2006 Macbook with ext4 on an SSD, and Ubuntu Trusty.
Changing the code so that the query db object is a copy of the update one instead of being separately opened makes the problem disappear, and I'll commit this change, as the reason for using two db objects has been gone for many years.
It is quite possible that the Recoll code is incorrect again, I have no simple program to reproduce the issue, and the single db object workaround is actually an improvement of the code, so I am creating this report more as a reference point than as a request for a fix.
Change History (11)
comment:1 by , 6 years ago
|Component:||Other → Backend-Chert|
|Status:||new → assigned|