Opened 10 years ago

Closed 9 years ago

#645 closed defect (fixed)

Read block errors after reopen()

Reported by: Jean-Francois Dockes Owned by: Olly Betts
Priority: low Milestone: 1.2.21
Component: Backend-Chert Version: 1.2.16
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Because of an ancient glitch (index flushes triggered by a query call), the Recoll indexing process uses 2 separately opened Database objects during indexing: one for updating the index, and the other one, readonly, for querying (mostly up-to-date signature values).

The queries using the readonly Database object get DatabaseModified exceptions, and call reopen() before retrying. This works well in general.

However, there are very rare cases where queries happening after the reopen() get other Xapian exceptions, like:

Expected block 0 to be level 1, not 0

Error reading block xxx: got end of file

I have also seen the process stuck in an infinite loop somewhere in the following call stack (probably near the bottom as I never get a shorter stack with CTL-C / continue inside gdb).

#0  __memcmp_ssse3 () at ../sysdeps/x86_64/multiarch/memcmp-ssse3.S:40
#1  0x00007fd81e3efbe0 in Key::operator<(Key) const ()
   from /usr/lib/libxapian.so.22
#2  0x00007fd81e3efca8 in ChertTable::find_in_block(unsigned char const*, Key, bool, int) () from /usr/lib/libxapian.so.22
#3  0x00007fd81e3f0cc3 in ChertTable::find(Cursor*) const ()
   from /usr/lib/libxapian.so.22
#4  0x00007fd81e3ccc69 in ChertCursor::find_entry(std::string const&) ()
   from /usr/lib/libxapian.so.22
#5  0x00007fd81e3f7283 in ?? () from /usr/lib/libxapian.so.22
#6  0x00007fd81e3fb59b in ?? () from /usr/lib/libxapian.so.22
#7  0x00007fd81e3dc3fa in ?? () from /usr/lib/libxapian.so.22
#8  0x00007fd81e3538a6 in Xapian::Document::Internal::get_value(unsigned int) const () from /usr/lib/libxapian.so.22
#9  0x00007fd81e35390c in Xapian::Document::get_value(unsigned int) const ()
   from /usr/lib/libxapian.so.22
#10 0x00007fd81f2f6bcb in Rcl::Db::needUpdate (this=0x1cee5f0, udi=..., 
    sig=..., existed=existed@entry=0x7fff7edcddc8) at ../rcldb/rcldb.cpp:1762
...

This all happens while the recoll 1.19.13 indexer is running SINGLE-THREADED, and I could reproduce it with Xapian 1.2.8 and 1.2.16

It happens that all known cases occurred on machines using SSDs, and it seems that the problem is easier to reproduce with a relatively slow CPU. I tried quite hard to reproduce the issue on a spinning disk system, with no luck. This might indicate that timing is somehow relevant. Also all cases were on Ubuntu, either 12.04 or 14.04

The original reporting user, who can reproduce the issue quite frequently, uses a 2006 Macbook with ext4 on an SSD, and Ubuntu Trusty.

Changing the code so that the query db object is a copy of the update one instead of being separately opened makes the problem disappear, and I'll commit this change, as the reason for using two db objects has been gone for many years.

It is quite possible that the Recoll code is incorrect again, I have no simple program to reproduce the issue, and the single db object workaround is actually an improvement of the code, so I am creating this report more as a reference point than as a request for a fix.

Change History (11)

comment:1 by Olly Betts, 10 years ago

Component: OtherBackend-Chert
Milestone: 1.3.3
Status: newassigned
Version: 1.2.16

FWIW, a query on a WritableDatabase can still cause data to be flushed to disk, but it's not nearly as brutal as it used to be.

With the chert backend, if you create an allterms iterator, all pending postlist changes are flushed to disk (but not committed). In query terms, that means a wildcard or partial query.

With the brass backend (at least currently), it will flush (but not commit) postlist data for any term used in the query. The thinking is that this means we can just use the standard code to iterate a postlist and avoid having special case code for iterating a postlist + pending modifications. For something like recoll with a low query rate and likely a small "working set" of terms in a given time window, that should work well.

Brass will also flush all doclength data if you iterate all docs - e.g. pure NOT - unless the docids are contiguous (in this special case we simply iterates from lowest to highest without reading the actual data from disk).

Brass is also smarter about flushing postlist changes for a wildcard, and will only flush the data for terms which match the wildcard (note that this data would get flushed when the query was actually run anyway).


Anyway, as for the actual issue, this clearly shouldn't happen. I'll try to reproduce it (I do have an SSD, but also a fast CPU).

I wonder if #640 might be related.

comment:2 by Jean-Francois Dockes, 10 years ago

After reading your info about flushes(), I checked the code to see if I was still using allterms_begin() (no). I guess that the actual calls involved may have some interest if you try to reproduce the issue.

So the following pseudo-code is executed when testing if index data is up to date for the next indexing candidate document:

// test doc index data up to date
try {
    string uniterm = make_uniterm(myid); // Unique id term for doc
    Xapian::PostingIterator docid = xdb.postlist_begin(uniterm);
    Xapian::Document doc = xdb.get_document(docid);
    string oldsig = doc.getvalue(VALUE_SIG);
    // Do non-xapian stuff with sig

    vector<Xapian::docid> children;
    try {
        string pterm = make_parentterm(myid);// Unique id term as parent
        children.insert(children.begin(), xdb.postlist_begin(pterm),
                        xdb.postlist_end(pterm));
            // do non-xapian stuff with children docids
    } catch (Xapian::DatabaseModifiedError) {
        // reopen and repeat
    }

} catch (Xapian::DatabaseModifiedError) {
    // reopen and repeat
}

There are 2 try blocks because the internal one is in a function which can also be called directly at other times. As far as I know, it's the external exception which is catched in general.

After the reopen, the first error is usually (apparently not always) the one about the wrong level. Then I get the read errors, for every call of this routine, and always the same block, until the end of indexing (so what happens is that all following documents get reindexed). I don't have a log file at hand, unfortunately, and things are not perfectly repeatable, so this is all subject to a little doubt, but I could possibly restart the test loop if you need more info.

The backend is the current default one for 1.2 (Chert I guess? iamchert exists in the xapian directory).

comment:3 by Jean-Francois Dockes, 10 years ago

Also about SSDs and CPU speeds, I can't be sure of course, because I have very little samples. But the two reporting users are using Ubuntu Trusty laptops refitted with SSDs (and old Core2 or Core2 Duo CPUs). And I seem to reproduce the problem much more easily on my Trusty Macbook Pro (Core2 Duo + SSD) than on my Ubuntu Precise desktop machine (recent core i7 + ssd). I could never trigger the issue on the server (Debian + core i5 + spinning disk). The Xapian versions are the standard ones for the respective Ubuntu releases and Wheezy.

Of course, having 5 samples is a little short to prove anything...

comment:4 by Olly Betts, 9 years ago

Milestone: 1.3.31.3.2

comment:5 by Olly Betts, 9 years ago

I suspect the "infinite loop" may be due to getting a negative key length, which gets interpreted as a huge unsigned value when passed to memcmp(). Not sure if that's a separate issue, or a side-effect of the same underlying issue.

comment:6 by Olly Betts, 9 years ago

Milestone: 1.3.21.3.3

I'm going to move this back to 1.3.3 again. It would be good to resolve it but I really don't want to delay 1.3.2 even longer.

comment:7 by Olly Betts, 9 years ago

I'm hopeful that the fix for #675 in [1900eca6d325e9185b9f9f5c75a4a4a545616d7f/git] might also solve this issue.

That commit fixes issues with reading values after reopen due to the value manager cursor having bad data when the table has gained a level, and while writing the regression test I even got an instance of a seemingly infinite loop.

Can you test, either with current master, or applying the chert changes from that commit to 1.2.x?

comment:8 by Olly Betts, 9 years ago

That fix has now been backported to the 1.2 branch, so using a snapshot from there is probably the simplest way to test this:

http://oligarchy.co.uk/xapian/branches/1.2/

comment:9 by Olly Betts, 9 years ago

Milestone: 1.3.31.3.4

I don't want to hold up 1.3.3 while we wait for confirmation.

comment:10 by Jean-Francois Dockes, 9 years ago

I tested the patch on 2 different systems:

  • Run test loop with Xapian 1.2.16, reproduce issue in a few minutes.
  • Install xapian-core-1.2.21_git129 (from branches/1.2), run one hour, no problem
  • Reinstall 1.2.16, issue is back on first try.

As far as I can see, the the issue is fixed in xapian-core-1.2.21_git129.

comment:11 by Olly Betts, 9 years ago

Milestone: 1.3.41.2.21
Resolution: fixed
Status: assignedclosed

Thanks for confirming.

Note: See TracTickets for help on using tickets.