Opened 7 years ago
Closed 7 years ago
#752 closed defect (fixed)
Segmentation fault in matcher/queryoptimiser
Reported by: | Robert Stepanek | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.4.5 |
Component: | Matcher | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
As posted on the xapian-devel mailing list on July 31, 2017:
Since a couple of weeks we are experiencing occasional segmentation
faults within Xapian 1.5. We can't reproduce the crashes, but we have
strong hints that they are due to memory corruption. We have narrowed
down our root cause analysis to phrase searches on multi-databases that
fail on reading the hint
field in theQueryOptimiser
class [1].
We'd appreciate any hints on how to fix this. I've written up our findings and solution attempts below. Should we post this on trac?
Our findings so far
In a core dump we see that calling the open_nearby_postlist
function
on the hint
variable [2] falls of a cliff, resulting in a segfault:
(gdb) bt 2 #0 0x000000000001eaa1 in ?? () #1 0x00007fa19d09231f in LocalSubMatch::open_post_list (this=0x13527d0, term=..., wqf=1, factor=1, need_positions=<optimized out>, in_synonym=<optimized out>, qopt=0x7ffe66370940, lazy_weight=false) at matcher/localsubmatch.cc:289
the line at localsubmatch.cc:289 is
pl = hint->open_nearby_postlist(term);
Unfortunately, the compiler had optimised a lot of debugging information away. Still, it's clear where the system crashed.
Following a similar crash and re-running the query we could not
reproduce the crash, but valgrind catched a read-after-free on the
hint
field (full valgrind log attached):
==2265126== Invalid read of size 8 ==2265126== at 0x9A6B313: LocalSubMatch::open_post_list(std::string const&, unsigned int, double, bool, bool, QueryOptimiser*, bool) (localsubmatch.cc:289)
which got free'd in this codepath:
==2265126== Address 0xb3fdfd0 is 0 bytes inside a block of size 216 free'd ==2265126== at 0x4C2A360: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==2265126== by 0x99C0B75: operator() (queryinternal.cc:65) ==2265126== by 0x99C0B75: for_each<__gnu_cxx::__normal_iterator<Xapian::PostingIterator::Internal**, std::vector<Xapian::PostingIterator::Internal*> >, delete_ptr<Xapian::PostingIterator::Internal> > (stl_algo.h:3755) ==2265126== by 0x99C0B75: Xapian::Internal::Context::reset() (queryinternal.cc:155) ==2265126== by 0x99C481F: Xapian::Internal::QueryWindowed::postlist_windowed(Xapian::Query::op, Xapian::Internal::AndContext&, QueryOptimiser*, double) const (queryinternal.cc:1668)
which is triggered by this condition in queryinternal.cc#L1665 ([2]);
if (!qopt->db.has_positions()) { // No positions in this subdatabase so this matches nothing, // which means the whole andcontext matches nothing. ctx.reset(); return; }
How to fix this
Here is what I believe is happening:
We are using subdatabases (all glass) for caching recent database
additions, and some of these do not contain positional information. This
makes the internal query code reset the AND-context, which in effect
frees its postlist. One of the postlist entries is still pointed at by
the hint
field of QueryOptimiser
from a previous submatch, and the
next call to get_hint_postlist
in queryoptimiser.h#L106 references the
bogus address.
A fix to avoid this is simple: just reset the QueryOptimiser
hint field
when free'ing the context postlist. I've written a one-liner for this
here [3]. But I'm not yet convinced that's all there is: could the hint
field be used already somewhere else? Should we probably keep it and
make QueryOptimiser
take ownership?
[1] https://github.com/xapian/xapian/blob/master/xapian-core/matcher/queryoptimiser.h#L51 [2] https://github.com/xapian/xapian/blob/master/xapian-core/api/queryinternal.cc#L1665 [3] https://github.com/rsto/xapian/commit/3e7d65b25eef00347f5c764af5ff4d770433ac9b [4] https://github.com/xapian/xapian/blob/master/xapian-core/matcher/queryoptimiser.h#L106
Change History (6)
comment:1 by , 7 years ago
Component: | Other → Matcher |
---|---|
Description: | modified (diff) |
Milestone: | → 1.4.5 |
Status: | new → assigned |
Version: | → git master |
comment:2 by , 7 years ago
comment:3 by , 7 years ago
Got it - you need something AND-ed with a phrase (so the AndContext
has something in pls
), but then you also need another term combined with a non-AND-like operator so that the hint is used, for exampel:
(this AND "this paragraph") OR wibble
comment:4 by , 7 years ago
I've compared the current approach of handing over ownership of the hint to the QueryOptimiser
object with just setting the hint to NULL
in situations where it might be invalidated.
The differences each way in my tests were very small, and similar in size but in opposite directions. As you'd expect, handing over ownership incurs a little extra overhead; but the gain in a small-scale test for the hint being set when it gets used was also small.
I think the conclusion is actually to keep the ownership mechanism - although it didn't give us a big gain in my tests, those were small scale tests and the gain with a larger database should be greater, while overhead for tracking ownership is essentially fixed. The key point is really that that we pay a small fixed cost which can save us nothing to lots.
comment:5 by , 7 years ago
Fixed in git master by [18f562bfbb32179e2db14d412aec20aee7408f30/git]. Needs backporting for 1.4.5. Was introduced in 1.4.3 so 1.2 branch is unaffected.
comment:6 by , 7 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Backported for 1.4.5 in [37ea4101dee3231a442827deea9e6abc621c74c9/git].
It's not obvious to me either.
The saving when the hint is close could be quite substantial, and the bookkeeping overhead is probably small, so I'd guess it's better to not clear the hint, but I don't think I spotted it could just be NULL-ed out in these cases when I added the code to track ownership.
I'll try some tests.
I'm thinking it would be good for reset() to either clear or hand over ownership of the hint, as that avoids a whole slew of potential similar issues. Your case is the second one, so empirically this seems a breeding ground for bugs - if we always leave the hint valid by default then we only need to be careful when removing some entries from Context.pls (and perhaps we can provide a helper for that too).
Ideally we should have a regression test for this, but I've been trying to create one based on existing testcase
subdbwithoutpos1
intests/api_query.cc
and I haven't so far managed.What does the query which causes the valgrind error look like? I'm interested in the entire Query object tree, not just a query string, in case the whole Query object isn't just built by parsing a query string in an obvious way. The exact terms probably don't matter, so feel free to genericise anything which might invade end user privacy.