Opened 3 weeks ago
Last modified 2 weeks ago
#838 new defect
Invalid write in EstimateOp::report_ratio
Reported by: | Robert Stepanek | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.5.0 |
Component: | Matcher | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
I am debugging a memory violation with latest 1.5 master when running Xapian::Enquire::get_mset for a specific query and database. I managed to reduce the query and database to the minimal set that's required to reproduce the memory violation, but now I'm stuck how to further debug and fix this.
The query looks like this (I replaced the actual terms with placeholders):
Query((term1@1 AND ((term2@1 PHRASE 2 term2@2) OR (term2@1 PHRASE 2 term2@2)) AND (<alldocuments> AND_NOT (<alldocuments> FILTER XEP))))
I can not share the database, but it's a glass database with the following characteristics:
$ xapian-delve-1.5 crasherdb UUID = fcdb211b-3bad-4583-8863-a99ef02a40fe number of documents = 2 average document length = 167.5 document length lower bound = 102 document length upper bound = 233 highest document id ever used = 3 has positional information = true revision = 1 currently open for writing = false
What happens for this query and this particular database is that valgrind reports an invalid write (full valgrind log is attached):
==29738== Invalid write of size 4 ==29738== at 0x4A417CC: report_ratio (estimateop.h:124) ==29738== by 0x4A417CC: SelectPostList::~SelectPostList() (selectpostlist.cc:61) ==29738== by 0x4A2ED93: ExactPhrasePostList::~ExactPhrasePostList() (exactphrasepostlist.cc:68) ==29738== by 0x4A4022F: ~OrPostList (orpostlist.h:71) ==29738== by 0x4A4022F: OrPostList::~OrPostList() (orpostlist.h:73) ==29738== by 0x4A2B67F: next_helper (andpostlist.h:75) ==29738== by 0x4A2B67F: AndPostList::next(double) (andpostlist.cc:124) ==29738== by 0x4A35DB3: PostListTree::next(double) (postlisttree.h:151) ==29738== by 0x4A32DC3: Matcher::get_local_mset(unsigned int, unsigned int, unsigned int, Xapian::Weight const&, Xapian::MatchDecider const*, Xapian::KeyMaker const*, unsigned int, unsigned int, int, double, double, Xapian::Enquire::docid_order, unsigned int, Xapian::Enquire::Internal::sort_setting, bool, double, std::vector<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy>, std::allocator<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy> > > const&) (matcher.cc:499)
which is because the written memory got freed previously
==29738== Address 0x52a248c is 12 bytes inside a block of size 40 free'd ==29738== at 0x4888360: operator delete(void*, unsigned long) (vg_replace_malloc.c:935) ==29738== by 0x493046B: pop_op (localsubmatch.h:112) ==29738== by 0x493046B: destroy_postlist (queryoptimiser.h:176) ==29738== by 0x493046B: shrink (queryinternal.cc:190) ==29738== by 0x493046B: Xapian::Internal::Context::~Context() (queryinternal.cc:153) ==29738== by 0x492B637: ~OrContext (queryinternal.cc:370) ==29738== by 0x492B637: operator() (unique_ptr.h:95) ==29738== by 0x492B637: operator() (unique_ptr.h:89) ==29738== by 0x492B637: reset (unique_ptr.h:203) ==29738== by 0x492B637: reset (unique_ptr.h:501) ==29738== by 0x492B637: Xapian::Internal::AndContext::postlist(Xapian::Internal::TermFreqs*) (queryinternal.cc:833) ==29738== by 0x492B8AF: Xapian::Internal::QueryAndLike::postlist(Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2502) ==29738== by 0x4A30A1B: LocalSubMatch::get_postlist(PostListTree*, unsigned int*) (localsubmatch.cc:188) ==29738== by 0x4A32A8B: Matcher::get_local_mset(unsigned int, unsigned int, unsigned int, Xapian::Weight const&, Xapian::MatchDecider const*, Xapian::KeyMaker const*, unsigned int, unsigned int, int, double, double, Xapian::Enquire::docid_order, unsigned int, Xapian::Enquire::Internal::sort_setting, bool, double, std::vector<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy>, std::allocator<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy> > > const&) (matcher.cc:381)
and that data had been allocated before at
==29738== Block was alloc'd at ==29738== at 0x4885828: operator new(unsigned long) (vg_replace_malloc.c:422) ==29738== by 0x492B4B7: add_op<EstimateOp::op_type> (localsubmatch.h:101) ==29738== by 0x492B4B7: add_op<EstimateOp::op_type> (queryoptimiser.h:84) ==29738== by 0x492B4B7: postlist (queryinternal.cc:629) ==29738== by 0x492B4B7: Xapian::Internal::AndContext::postlist(Xapian::Internal::TermFreqs*) (queryinternal.cc:842) ==29738== by 0x492B8AF: Xapian::Internal::QueryAndLike::postlist(Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2502) ==29738== by 0x492873F: Xapian::Query::Internal::postlist_sub_or_like(Xapian::Internal::OrContext&, Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*, bool) const (queryinternal.cc:1174) ==29738== by 0x492C307: Xapian::Internal::QueryBranch::do_or_like(Xapian::Internal::OrContext&, Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*, unsigned int, unsigned long, bool) const (queryinternal.cc:2256) ==29738== by 0x492C9AF: Xapian::Internal::QueryOr::postlist(Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2619) ==29738== by 0x492A977: Xapian::Query::Internal::postlist_sub_and_like(Xapian::Internal::AndContext&, Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:1163) ==29738== by 0x492524F: Xapian::Internal::QueryAndLike::postlist_sub_and_like(Xapian::Internal::AndContext&, Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2515) ==29738== by 0x492B89B: Xapian::Internal::QueryAndLike::postlist(Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2499) ==29738== by 0x4A30A1B: LocalSubMatch::get_postlist(PostListTree*, unsigned int*) (localsubmatch.cc:188) ==29738== by 0x4A32A8B: Matcher::get_local_mset(unsigned int, unsigned int, unsigned int, Xapian::Weight const&, Xapian::MatchDecider const*, Xapian::KeyMaker const*, unsigned int, unsigned int, int, double, double, Xapian::Enquire::docid_order, unsigned int, Xapian::Enquire::Internal::sort_setting, bool, double, std::vector<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy>, std::allocator<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy> > > const&) (matcher.cc:381)
The valgrind log suggests to me that the various rewrites of the EstimateOp stack do no propagate to all the places where pointers to those EstimateOps are held. To mitigate the invalid write, a shared_ptr might help, but I fear that would just hide the logical bug that's occurring here? I might try doing that though, to at least mitigate the crash until I could fix the actual root cause.
Since I can't share the database as-is, is there some way I can rewrite the terms in that database to obfuscate its contents but still replicate the crash so that I can share it? I tried trimming down the database to the minimal set using the WritableDatabase API, but it looks as if I can't mess with the postlists or term lists that way?
Attachments (4)
Change History (14)
by , 3 weeks ago
Attachment: | valgrind.txt added |
---|
comment:1 by , 3 weeks ago
I wonder if it's the hint
postlist that's to blame. That prolongs the lifetime of one postlist, so its destructor gets called later. The one thing that doesn't make sense for this as the explanation is that the hint should always be the postlist for a term, but the failure happens for a SelectPostList
.
Could you try with the attached patch?
If that fixes it, it would be a plausible short-term workaround but would need careful study as to whether it harms performance in the cases the hint
postlist mechanism is there to improve.
It'd be really good to have a reproducer we can put in the testsuite, and it seems to be only two documents so hopefully we can find a way. I suspect the exact circumstances matter, but it's likely to only be the ordering and occurrence patterns of the terms.
You can't edit postlists or termlists directly - the way to do it would be to scan the documents from the database and create a new database with new documents based on them. Like copydatabase but instead of adding the documents from the existing database to the new one, create new documents with a subset of the terms and add those instead.
The size of the database doesn't seem to be a problem though. If we had an "anonymiser" which replaced every term with an arbitrary one but preserving the collation order and with the same wdf and term positions, would you be able to share the database passed through that? I think it would be easy enough to write such a tool (and perhaps a useful one to have for future instances where a bug is reproducible but only with a database which can't be shared).
(E.g. we have 335 words over two documents - at worst they're all unique words, so if we map them to terms which are 3 digit numbers then e.g. a
-> 000
, aardvark
-> 001
, etc.)
comment:2 by , 3 weeks ago
Hmm, but if it was the hint then accepted
and rejected
should still be zero so we shouldn't dereference estimate_op
:
SelectPostList::~SelectPostList() { if (estimate_op && (accepted || rejected)) { // Only call report_ratio() if there are counts. During the building // of the PostList tree we sometimes need to delete PostList objects // and their associated EstimateOp and it's hard to arrange that they // are always deleted in the correct order. estimate_op->report_ratio(accepted, rejected); } }
The ownership tracking around this has probably become too complicated. shared_ptr
probably helps build something that is easier to reliably reason about object lifetimes in, but it feels like it's not actually fixing the underlying design of this to be cleaner (essentially hint postlists and then the EstimateOp
machinery got bolted on to an existing system).
Anyway, let's try to work out what's actually going on here.
comment:3 by , 3 weeks ago
Version: | → git master |
---|
Setting version - the EstimateOp
mechanism isn't in 1.4.x so bugs in it shouldn't be able to manifest there.
by , 3 weeks ago
Attachment: | ticket838.tar added |
---|
comment:4 by , 3 weeks ago
I could not reproduce the invalid write with obfuscated terms, but I managed to trim the database to the minimal set of terms required to make valgrind detect it.
This is all there is
$ xapian-delve-1.5 -a -1 -vv ticket838_db/ All terms in database (termfreq, collection freq, wdf upper bound): XEP 1 0 0 XFgroups 1 1 1 XSchange 1 1 1 XSmember 1 1 1
I attached a new file "ticket838.tar" to this ticket, which includes the database as well as a C++ file which reproduces the invalid write. Running it with valgrind produces the valgrind log I had submitted initially.
I did not find any patch to disable hints, so I did not try that.
comment:5 by , 3 weeks ago
Milestone: | → 1.5.0 |
---|
Oops, I failed to actually attach the patch yesterday - now there.
Thanks for the reduced testcase, I'll have a poke.
comment:6 by , 3 weeks ago
I do also get the valgrind errors.
The query can be simplified a little, e.g. this triggers it:
libtool --mode=execute valgrind examples/quest -d ticket838_db -s '' -p s:XS -b x:X 's:member AND (s:"member member" OR s:member) NOT x:EP'
Query object from that:
Parsed Query: Query(((XSmember@1 AND ((XSmember@2 PHRASE 2 XSmember@3) OR XSmember@4)) AND_NOT XEP))
(Using quest because it's less faff to try to minimise by edits of a command line than edit/compile/run of C++.)
So it seems the MatchAll
s aren't relevant. The phrase clearly is since the bad write is from SelectPostList
's destructor. I can't seem to remove anything else and still trigger it. Also XEP
does need to be a different term.
comment:7 by , 3 weeks ago
The patch to clear the hint doesn't seem to help.
Adding some prints to the code:
destroy_postlist(XEP:1) pop_op() popping type 6, now doing 2 subqueries: pop_op() popping type 0 pop_op() popping type 8, now doing 2 subqueries: pop_op() popping type 0 pop_op() popping type 4, now doing 1 subqueries: pop_op() popping type 6, now doing 2 subqueries: pop_op() popping type 0 pop_op() popping type 0
The op types are from:
enum op_type { KNOWN, DECIDER = 1, NEAR = 2, PHRASE = 3, EXACT_PHRASE = 4, POSTING_SOURCE, AND, AND_NOT, OR, XOR };
So it does pop_op of an AND
with 2 subqueries, which is clearly not the right thing at all. I'm amazed this doesn't go wrong more, but the testsuite is run under valgrind in CI and passes.
I think the trigger for this is that it can probably see that the right side of the AND_NOT
can be discarded because the left and right sides can't match at the same time (that can be spotted from the term range info but I haven't followed through that code yet to check that's actually how it gets to there).
This is probably a reasonable short-term workaround:
diff --git a/xapian-core/matcher/queryoptimiser.h b/xapian-core/matcher/queryoptimiser.h index 5ca302a00a0c..8f2bf591fe04 100644 --- a/xapian-core/matcher/queryoptimiser.h +++ b/xapian-core/matcher/queryoptimiser.h @@ -173,7 +173,7 @@ class QueryOptimiser { } // Remove the EstimateOp (and any sub-ops) which we generated for this // PostList. - localsubmatch.pop_op(); + //localsubmatch.pop_op(); } bool need_wdf_for_compound_weight() const {
We should resolve all the EstimateOp
s anyway, so having extra ones should be OK. Maybe that's actually the right fix, but I definitely want to look at this in more detail.
comment:8 by , 3 weeks ago
I think the trigger for this is that it can probably see that the right side of the AND_NOT can be discarded because the left and right sides can't match at the same time (that can be spotted from the term range info but I haven't followed through that code yet to check that's actually how it gets to there).
That seems right - we've created postlists (and EstimateOp
s) for both sides of the AND_NOT and before we create the AndNotPostList
(or its EstimateOp
) we can see the two sides are disjoint so we can discard the right side. We try to pop_op()
to remove the unwanted right hand side EstimateOp
, but the order of the ops in the estimate_stack
isn't right for doing that.
We should resolve all the
EstimateOp
s anyway, so having extra ones should be OK. Maybe that's actually the right fix, but I definitely want to look at this in more detail.
If we don't pop_op()
then we need to create an AND_NOT
EstimateOp
for the estimates to get resolved correctly.
The parsed query and stack (with the missing AND_NOT
added) would look like this (where [first, last]
is the known possible range of documents matched):
Parsed Query: Query(((XSmember@1 AND ((XSmember@2 PHRASE 2 XSmember@3) OR XSmember@4)) AND_NOT XEP)) AND_NOT x2 [1, 1] AND x2 [1, 1] KNOWN min=1 est=1 max=1 [2, 2] OR x2 KNOWN min=1 est=1 max=1 [1, 1] EXACT_PHRASE AND x2 [1, 1] KNOWN min=1 est=1 max=1 [1, 1] KNOWN min=1 est=1 max=1 [1, 1] KNOWN min=1 est=1 max=1 [1, 1]
KNOWN min=1 est=1 max=1 [2, 2]
must be XEP
so somehow the order is wrong - that and the AND
should be the other way around which would give:
AND_NOT x2 [1, 1] KNOWN min=1 est=1 max=1 [2, 2] AND x2 [1, 1] OR x2 KNOWN min=1 est=1 max=1 [1, 1] EXACT_PHRASE AND x2 [1, 1] KNOWN min=1 est=1 max=1 [1, 1] KNOWN min=1 est=1 max=1 [1, 1] KNOWN min=1 est=1 max=1 [1, 1]
And if that were the case, the existing code which calls pop_op()
would indeed remove the now unwanted EstimateOp
from the right side of the AND_NOT
.
comment:9 by , 3 weeks ago
I've found the cause of that problem - we need to create the AND_NOT
ExpandOp
before the AND
one. With candidate-fix.patch
applied, valgrind reports no errors.
I'm not entirely happy with the design here though - things need to be created in the right order and while that order mostly matches the natural order that things happen in, this case shows it's possible to get the order wrong without that standing out as wrong in the code.
I think given the underlying problem wasn't the object lifetimes (but rather that we were deleting the wrong objects) that in the short term I'll apply that fix (assuming it works for your full case), turn your reproducer into a regression test, and create testcases which fully exercise other places where the EstimateOp
handling is complex. The testsuite gets run under valgrind in CI (and by me locally) so that should catch any other similar problems.
I think the case that is wrong looks like the most tricky, because AndContext
pulls together all "connected" and-like operations and groups the associated "nots" and "maybes", so in the general case it needs to build a complex tree of different postlist types and so handle the corresponding EstimateOp
s too.
comment:10 by , 2 weeks ago
Thanks for this. I don't have anything to add to this analysis. I applied the candidate fix and did a bunch of local tests, all of which succeeded without invalid memory writes. I will now apply this patch to a broader set of real-world queries in a test environment.
Valgrind log