Opened 7 years ago

Closed 7 years ago

#752 closed defect (fixed)

Segmentation fault in matcher/queryoptimiser

Reported by: Robert Stepanek Owned by: Olly Betts
Priority: normal Milestone: 1.4.5
Component: Matcher Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

As posted on the xapian-devel mailing list on July 31, 2017:

Since a couple of weeks we are experiencing occasional segmentation faults within Xapian 1.5. We can't reproduce the crashes, but we have strong hints that they are due to memory corruption. We have narrowed down our root cause analysis to phrase searches on multi-databases that fail on reading the hint field in theQueryOptimiserclass [1].

We'd appreciate any hints on how to fix this. I've written up our findings and solution attempts below. Should we post this on trac?

Our findings so far

In a core dump we see that calling the open_nearby_postlist function on the hint variable [2] falls of a cliff, resulting in a segfault:

    (gdb) bt 2
    #0  0x000000000001eaa1 in ?? ()
    #1  0x00007fa19d09231f in LocalSubMatch::open_post_list
    (this=0x13527d0, term=..., wqf=1, factor=1,
        need_positions=<optimized out>, in_synonym=<optimized out>,
        qopt=0x7ffe66370940, lazy_weight=false)
        at matcher/localsubmatch.cc:289

the line at localsubmatch.cc:289 is

    pl = hint->open_nearby_postlist(term);

Unfortunately, the compiler had optimised a lot of debugging information away. Still, it's clear where the system crashed.

Following a similar crash and re-running the query we could not reproduce the crash, but valgrind catched a read-after-free on the hint field (full valgrind log attached):

    ==2265126== Invalid read of size 8
    ==2265126==    at 0x9A6B313:
    LocalSubMatch::open_post_list(std::string const&, unsigned int,
    double, bool, bool, QueryOptimiser*, bool) (localsubmatch.cc:289)

which got free'd in this codepath:

    ==2265126==  Address 0xb3fdfd0 is 0 bytes inside a block of size 216
    free'd
    ==2265126==    at 0x4C2A360: operator delete(void*) (in
    /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==2265126==    by 0x99C0B75: operator() (queryinternal.cc:65)
    ==2265126==    by 0x99C0B75:
    for_each<__gnu_cxx::__normal_iterator<Xapian::PostingIterator::Internal**,
    std::vector<Xapian::PostingIterator::Internal*> >,
    delete_ptr<Xapian::PostingIterator::Internal> > (stl_algo.h:3755)
    ==2265126==    by 0x99C0B75: Xapian::Internal::Context::reset()
    (queryinternal.cc:155)
    ==2265126==    by 0x99C481F:
    Xapian::Internal::QueryWindowed::postlist_windowed(Xapian::Query::op,
    Xapian::Internal::AndContext&, QueryOptimiser*, double) const
    (queryinternal.cc:1668)

which is triggered by this condition in queryinternal.cc#L1665 ([2]);

        if (!qopt->db.has_positions()) {
	// No positions in this subdatabase so this matches nothing,
	// which means the whole andcontext matches nothing.
	ctx.reset();
	return;
    }

How to fix this

Here is what I believe is happening:

We are using subdatabases (all glass) for caching recent database additions, and some of these do not contain positional information. This makes the internal query code reset the AND-context, which in effect frees its postlist. One of the postlist entries is still pointed at by the hint field of QueryOptimiser from a previous submatch, and the next call to get_hint_postlist in queryoptimiser.h#L106 references the bogus address.

A fix to avoid this is simple: just reset the QueryOptimiser hint field when free'ing the context postlist. I've written a one-liner for this here [3]. But I'm not yet convinced that's all there is: could the hint field be used already somewhere else? Should we probably keep it and make QueryOptimiser take ownership?

[1] https://github.com/xapian/xapian/blob/master/xapian-core/matcher/queryoptimiser.h#L51 [2] https://github.com/xapian/xapian/blob/master/xapian-core/api/queryinternal.cc#L1665 [3] https://github.com/rsto/xapian/commit/3e7d65b25eef00347f5c764af5ff4d770433ac9b [4] https://github.com/xapian/xapian/blob/master/xapian-core/matcher/queryoptimiser.h#L106

Change History (6)

comment:1 by Olly Betts, 7 years ago

Component: OtherMatcher
Description: modified (diff)
Milestone: 1.4.5
Status: newassigned
Version: git master

comment:2 by Olly Betts, 7 years ago

I am not deep enough into the query optimiser code to know which approach is better in term of performance.

It's not obvious to me either.

The saving when the hint is close could be quite substantial, and the bookkeeping overhead is probably small, so I'd guess it's better to not clear the hint, but I don't think I spotted it could just be NULL-ed out in these cases when I added the code to track ownership.

I'll try some tests.

If taking ownership is preferred I could make the context resetter aware which postlist member to skip during the free().

I'm thinking it would be good for reset() to either clear or hand over ownership of the hint, as that avoids a whole slew of potential similar issues. Your case is the second one, so empirically this seems a breeding ground for bugs - if we always leave the hint valid by default then we only need to be careful when removing some entries from Context.pls (and perhaps we can provide a helper for that too).

Ideally we should have a regression test for this, but I've been trying to create one based on existing testcase subdbwithoutpos1 in tests/api_query.cc and I haven't so far managed.

What does the query which causes the valgrind error look like? I'm interested in the entire Query object tree, not just a query string, in case the whole Query object isn't just built by parsing a query string in an obvious way. The exact terms probably don't matter, so feel free to genericise anything which might invade end user privacy.

comment:3 by Olly Betts, 7 years ago

Got it - you need something AND-ed with a phrase (so the AndContext has something in pls), but then you also need another term combined with a non-AND-like operator so that the hint is used, for exampel:

(this AND "this paragraph") OR wibble

comment:4 by Olly Betts, 7 years ago

I've compared the current approach of handing over ownership of the hint to the QueryOptimiser object with just setting the hint to NULL in situations where it might be invalidated.

The differences each way in my tests were very small, and similar in size but in opposite directions. As you'd expect, handing over ownership incurs a little extra overhead; but the gain in a small-scale test for the hint being set when it gets used was also small.

I think the conclusion is actually to keep the ownership mechanism - although it didn't give us a big gain in my tests, those were small scale tests and the gain with a larger database should be greater, while overhead for tracking ownership is essentially fixed. The key point is really that that we pay a small fixed cost which can save us nothing to lots.

comment:5 by Olly Betts, 7 years ago

Fixed in git master by [18f562bfbb32179e2db14d412aec20aee7408f30/git]. Needs backporting for 1.4.5. Was introduced in 1.4.3 so 1.2 branch is unaffected.

comment:6 by Olly Betts, 7 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.