Opened 3 weeks ago

Last modified 2 weeks ago

#838 new defect

Invalid write in EstimateOp::report_ratio

Reported by: Robert Stepanek Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: Matcher Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

I am debugging a memory violation with latest 1.5 master when running Xapian::Enquire::get_mset for a specific query and database. I managed to reduce the query and database to the minimal set that's required to reproduce the memory violation, but now I'm stuck how to further debug and fix this.

The query looks like this (I replaced the actual terms with placeholders):

Query((term1@1 AND ((term2@1 PHRASE 2 term2@2) OR (term2@1 PHRASE 2 term2@2)) AND (<alldocuments> AND_NOT (<alldocuments> FILTER XEP))))

I can not share the database, but it's a glass database with the following characteristics:

$ xapian-delve-1.5 crasherdb
UUID = fcdb211b-3bad-4583-8863-a99ef02a40fe
number of documents = 2
average document length = 167.5
document length lower bound = 102
document length upper bound = 233
highest document id ever used = 3
has positional information = true
revision = 1
currently open for writing = false

What happens for this query and this particular database is that valgrind reports an invalid write (full valgrind log is attached):

==29738== Invalid write of size 4
==29738==    at 0x4A417CC: report_ratio (estimateop.h:124)
==29738==    by 0x4A417CC: SelectPostList::~SelectPostList() (selectpostlist.cc:61)
==29738==    by 0x4A2ED93: ExactPhrasePostList::~ExactPhrasePostList() (exactphrasepostlist.cc:68)
==29738==    by 0x4A4022F: ~OrPostList (orpostlist.h:71)
==29738==    by 0x4A4022F: OrPostList::~OrPostList() (orpostlist.h:73)
==29738==    by 0x4A2B67F: next_helper (andpostlist.h:75)
==29738==    by 0x4A2B67F: AndPostList::next(double) (andpostlist.cc:124)
==29738==    by 0x4A35DB3: PostListTree::next(double) (postlisttree.h:151)
==29738==    by 0x4A32DC3: Matcher::get_local_mset(unsigned int, unsigned int, unsigned int, Xapian::Weight const&, Xapian::MatchDecider const*, Xapian::KeyMaker const*, unsigned int, unsigned int, int, double, double, Xapian::Enquire::docid_order, unsigned int, Xapian::Enquire::Internal::sort_setting, bool, double, std::vector<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy>, std::allocator<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy> > > const&) (matcher.cc:499)

which is because the written memory got freed previously

==29738==  Address 0x52a248c is 12 bytes inside a block of size 40 free'd
==29738==    at 0x4888360: operator delete(void*, unsigned long) (vg_replace_malloc.c:935)
==29738==    by 0x493046B: pop_op (localsubmatch.h:112)
==29738==    by 0x493046B: destroy_postlist (queryoptimiser.h:176)
==29738==    by 0x493046B: shrink (queryinternal.cc:190)
==29738==    by 0x493046B: Xapian::Internal::Context::~Context() (queryinternal.cc:153)
==29738==    by 0x492B637: ~OrContext (queryinternal.cc:370)
==29738==    by 0x492B637: operator() (unique_ptr.h:95)
==29738==    by 0x492B637: operator() (unique_ptr.h:89)
==29738==    by 0x492B637: reset (unique_ptr.h:203)
==29738==    by 0x492B637: reset (unique_ptr.h:501)
==29738==    by 0x492B637: Xapian::Internal::AndContext::postlist(Xapian::Internal::TermFreqs*) (queryinternal.cc:833)
==29738==    by 0x492B8AF: Xapian::Internal::QueryAndLike::postlist(Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2502)
==29738==    by 0x4A30A1B: LocalSubMatch::get_postlist(PostListTree*, unsigned int*) (localsubmatch.cc:188)
==29738==    by 0x4A32A8B: Matcher::get_local_mset(unsigned int, unsigned int, unsigned int, Xapian::Weight const&, Xapian::MatchDecider const*, Xapian::KeyMaker const*, unsigned int, unsigned int, int, double, double, Xapian::Enquire::docid_order, unsigned int, Xapian::Enquire::Internal::sort_setting, bool, double, std::vector<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy>, std::allocator<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy> > > const&) (matcher.cc:381)

and that data had been allocated before at

==29738==  Block was alloc'd at
==29738==    at 0x4885828: operator new(unsigned long) (vg_replace_malloc.c:422)
==29738==    by 0x492B4B7: add_op<EstimateOp::op_type> (localsubmatch.h:101)
==29738==    by 0x492B4B7: add_op<EstimateOp::op_type> (queryoptimiser.h:84)
==29738==    by 0x492B4B7: postlist (queryinternal.cc:629)
==29738==    by 0x492B4B7: Xapian::Internal::AndContext::postlist(Xapian::Internal::TermFreqs*) (queryinternal.cc:842)
==29738==    by 0x492B8AF: Xapian::Internal::QueryAndLike::postlist(Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2502)
==29738==    by 0x492873F: Xapian::Query::Internal::postlist_sub_or_like(Xapian::Internal::OrContext&, Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*, bool) const (queryinternal.cc:1174)
==29738==    by 0x492C307: Xapian::Internal::QueryBranch::do_or_like(Xapian::Internal::OrContext&, Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*, unsigned int, unsigned long, bool) const (queryinternal.cc:2256)
==29738==    by 0x492C9AF: Xapian::Internal::QueryOr::postlist(Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2619)
==29738==    by 0x492A977: Xapian::Query::Internal::postlist_sub_and_like(Xapian::Internal::AndContext&, Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:1163)
==29738==    by 0x492524F: Xapian::Internal::QueryAndLike::postlist_sub_and_like(Xapian::Internal::AndContext&, Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2515)
==29738==    by 0x492B89B: Xapian::Internal::QueryAndLike::postlist(Xapian::Internal::QueryOptimiser*, double, Xapian::Internal::TermFreqs*) const (queryinternal.cc:2499)
==29738==    by 0x4A30A1B: LocalSubMatch::get_postlist(PostListTree*, unsigned int*) (localsubmatch.cc:188)
==29738==    by 0x4A32A8B: Matcher::get_local_mset(unsigned int, unsigned int, unsigned int, Xapian::Weight const&, Xapian::MatchDecider const*, Xapian::KeyMaker const*, unsigned int, unsigned int, int, double, double, Xapian::Enquire::docid_order, unsigned int, Xapian::Enquire::Internal::sort_setting, bool, double, std::vector<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy>, std::allocator<Xapian::Internal::opt_intrusive_ptr<Xapian::MatchSpy> > > const&) (matcher.cc:381)

The valgrind log suggests to me that the various rewrites of the EstimateOp stack do no propagate to all the places where pointers to those EstimateOps are held. To mitigate the invalid write, a shared_ptr might help, but I fear that would just hide the logical bug that's occurring here? I might try doing that though, to at least mitigate the crash until I could fix the actual root cause.

Since I can't share the database as-is, is there some way I can rewrite the terms in that database to obfuscate its contents but still replicate the crash so that I can share it? I tried trimming down the database to the minimal set using the WritableDatabase API, but it looks as if I can't mess with the postlists or term lists that way?

Attachments (4)

valgrind.txt (29.3 KB ) - added by Robert Stepanek 3 weeks ago.
Valgrind log
ticket838.tar (40.0 KB ) - added by Robert Stepanek 3 weeks ago.
clear-hint.patch (825 bytes ) - added by Olly Betts 3 weeks ago.
Patch to clear the hint
candidate-fix.patch (917 bytes ) - added by Olly Betts 3 weeks ago.
Candidate fix

Download all attachments as: .zip

Change History (14)

by Robert Stepanek, 3 weeks ago

Attachment: valgrind.txt added

Valgrind log

comment:1 by Olly Betts, 3 weeks ago

I wonder if it's the hint postlist that's to blame. That prolongs the lifetime of one postlist, so its destructor gets called later. The one thing that doesn't make sense for this as the explanation is that the hint should always be the postlist for a term, but the failure happens for a SelectPostList.

Could you try with the attached patch?

If that fixes it, it would be a plausible short-term workaround but would need careful study as to whether it harms performance in the cases the hint postlist mechanism is there to improve.

It'd be really good to have a reproducer we can put in the testsuite, and it seems to be only two documents so hopefully we can find a way. I suspect the exact circumstances matter, but it's likely to only be the ordering and occurrence patterns of the terms.

You can't edit postlists or termlists directly - the way to do it would be to scan the documents from the database and create a new database with new documents based on them. Like copydatabase but instead of adding the documents from the existing database to the new one, create new documents with a subset of the terms and add those instead.

The size of the database doesn't seem to be a problem though. If we had an "anonymiser" which replaced every term with an arbitrary one but preserving the collation order and with the same wdf and term positions, would you be able to share the database passed through that? I think it would be easy enough to write such a tool (and perhaps a useful one to have for future instances where a bug is reproducible but only with a database which can't be shared).

(E.g. we have 335 words over two documents - at worst they're all unique words, so if we map them to terms which are 3 digit numbers then e.g. a -> 000, aardvark -> 001, etc.)

comment:2 by Olly Betts, 3 weeks ago

Hmm, but if it was the hint then accepted and rejected should still be zero so we shouldn't dereference estimate_op:

SelectPostList::~SelectPostList()
{
    if (estimate_op && (accepted || rejected)) {
        // Only call report_ratio() if there are counts.  During the building
        // of the PostList tree we sometimes need to delete PostList objects
        // and their associated EstimateOp and it's hard to arrange that they
        // are always deleted in the correct order.
        estimate_op->report_ratio(accepted, rejected);
    }
}

The ownership tracking around this has probably become too complicated. shared_ptr probably helps build something that is easier to reliably reason about object lifetimes in, but it feels like it's not actually fixing the underlying design of this to be cleaner (essentially hint postlists and then the EstimateOp machinery got bolted on to an existing system).

Anyway, let's try to work out what's actually going on here.

comment:3 by Olly Betts, 3 weeks ago

Version: git master

Setting version - the EstimateOp mechanism isn't in 1.4.x so bugs in it shouldn't be able to manifest there.

by Robert Stepanek, 3 weeks ago

Attachment: ticket838.tar added

comment:4 by Robert Stepanek, 3 weeks ago

I could not reproduce the invalid write with obfuscated terms, but I managed to trim the database to the minimal set of terms required to make valgrind detect it.

This is all there is

$ xapian-delve-1.5 -a -1 -vv ticket838_db/
All terms in database (termfreq, collection freq, wdf upper bound):
XEP 1 0 0
XFgroups 1 1 1
XSchange 1 1 1
XSmember 1 1 1

I attached a new file "ticket838.tar" to this ticket, which includes the database as well as a C++ file which reproduces the invalid write. Running it with valgrind produces the valgrind log I had submitted initially.

I did not find any patch to disable hints, so I did not try that.

Last edited 3 weeks ago by Robert Stepanek (previous) (diff)

by Olly Betts, 3 weeks ago

Attachment: clear-hint.patch added

Patch to clear the hint

comment:5 by Olly Betts, 3 weeks ago

Milestone: 1.5.0

Oops, I failed to actually attach the patch yesterday - now there.

Thanks for the reduced testcase, I'll have a poke.

comment:6 by Olly Betts, 3 weeks ago

I do also get the valgrind errors.

The query can be simplified a little, e.g. this triggers it:

libtool --mode=execute valgrind examples/quest -d ticket838_db -s '' -p s:XS -b x:X 's:member AND (s:"member member" OR s:member) NOT x:EP'

Query object from that:

Parsed Query: Query(((XSmember@1 AND ((XSmember@2 PHRASE 2 XSmember@3) OR XSmember@4)) AND_NOT XEP))

(Using quest because it's less faff to try to minimise by edits of a command line than edit/compile/run of C++.)

So it seems the MatchAlls aren't relevant. The phrase clearly is since the bad write is from SelectPostList's destructor. I can't seem to remove anything else and still trigger it. Also XEP does need to be a different term.

comment:7 by Olly Betts, 3 weeks ago

The patch to clear the hint doesn't seem to help.

Adding some prints to the code:

destroy_postlist(XEP:1)
pop_op() popping type 6, now doing 2 subqueries:
pop_op() popping type 0
pop_op() popping type 8, now doing 2 subqueries:
pop_op() popping type 0
pop_op() popping type 4, now doing 1 subqueries:
pop_op() popping type 6, now doing 2 subqueries:
pop_op() popping type 0
pop_op() popping type 0

The op types are from:

    enum op_type {                                                                             
        KNOWN,                                                                           
        DECIDER = 1,                                                                                        
        NEAR = 2, PHRASE = 3, EXACT_PHRASE = 4,                                                                                       
        POSTING_SOURCE,                                                                                       
        AND, AND_NOT, OR, XOR                                                                                 
    };                                                                                            

So it does pop_op of an AND with 2 subqueries, which is clearly not the right thing at all. I'm amazed this doesn't go wrong more, but the testsuite is run under valgrind in CI and passes.

I think the trigger for this is that it can probably see that the right side of the AND_NOT can be discarded because the left and right sides can't match at the same time (that can be spotted from the term range info but I haven't followed through that code yet to check that's actually how it gets to there).

This is probably a reasonable short-term workaround:

diff --git a/xapian-core/matcher/queryoptimiser.h b/xapian-core/matcher/queryoptimiser.h
index 5ca302a00a0c..8f2bf591fe04 100644
--- a/xapian-core/matcher/queryoptimiser.h
+++ b/xapian-core/matcher/queryoptimiser.h
@@ -173,7 +173,7 @@ class QueryOptimiser {
 	}
 	// Remove the EstimateOp (and any sub-ops) which we generated for this
 	// PostList.
-	localsubmatch.pop_op();
+	//localsubmatch.pop_op();
     }
 
     bool need_wdf_for_compound_weight() const {

We should resolve all the EstimateOps anyway, so having extra ones should be OK. Maybe that's actually the right fix, but I definitely want to look at this in more detail.

comment:8 by Olly Betts, 3 weeks ago

I think the trigger for this is that it can probably see that the right side of the AND_NOT can be discarded because the left and right sides can't match at the same time (that can be spotted from the term range info but I haven't followed through that code yet to check that's actually how it gets to there).

That seems right - we've created postlists (and EstimateOps) for both sides of the AND_NOT and before we create the AndNotPostList (or its EstimateOp) we can see the two sides are disjoint so we can discard the right side. We try to pop_op() to remove the unwanted right hand side EstimateOp, but the order of the ops in the estimate_stack isn't right for doing that.

We should resolve all the EstimateOps anyway, so having extra ones should be OK. Maybe that's actually the right fix, but I definitely want to look at this in more detail.

If we don't pop_op() then we need to create an AND_NOT EstimateOp for the estimates to get resolved correctly.

The parsed query and stack (with the missing AND_NOT added) would look like this (where [first, last] is the known possible range of documents matched):

Parsed Query: Query(((XSmember@1 AND ((XSmember@2 PHRASE 2 XSmember@3) OR XSmember@4)) AND_NOT XEP))

AND_NOT x2 [1, 1]
  AND x2 [1, 1]
    KNOWN min=1 est=1 max=1 [2, 2]
    OR x2
      KNOWN min=1 est=1 max=1 [1, 1]
      EXACT_PHRASE
	AND x2 [1, 1]
	  KNOWN min=1 est=1 max=1 [1, 1]
	  KNOWN min=1 est=1 max=1 [1, 1]
  KNOWN min=1 est=1 max=1 [1, 1]

KNOWN min=1 est=1 max=1 [2, 2] must be XEP so somehow the order is wrong - that and the AND should be the other way around which would give:

AND_NOT x2 [1, 1]
  KNOWN min=1 est=1 max=1 [2, 2]
  AND x2 [1, 1]
    OR x2
      KNOWN min=1 est=1 max=1 [1, 1]
      EXACT_PHRASE
	AND x2 [1, 1]
	  KNOWN min=1 est=1 max=1 [1, 1]
	  KNOWN min=1 est=1 max=1 [1, 1]
    KNOWN min=1 est=1 max=1 [1, 1]

And if that were the case, the existing code which calls pop_op() would indeed remove the now unwanted EstimateOp from the right side of the AND_NOT.

Last edited 3 weeks ago by Olly Betts (previous) (diff)

by Olly Betts, 3 weeks ago

Attachment: candidate-fix.patch added

Candidate fix

comment:9 by Olly Betts, 3 weeks ago

I've found the cause of that problem - we need to create the AND_NOT ExpandOp before the AND one. With candidate-fix.patch applied, valgrind reports no errors.

I'm not entirely happy with the design here though - things need to be created in the right order and while that order mostly matches the natural order that things happen in, this case shows it's possible to get the order wrong without that standing out as wrong in the code.

I think given the underlying problem wasn't the object lifetimes (but rather that we were deleting the wrong objects) that in the short term I'll apply that fix (assuming it works for your full case), turn your reproducer into a regression test, and create testcases which fully exercise other places where the EstimateOp handling is complex. The testsuite gets run under valgrind in CI (and by me locally) so that should catch any other similar problems.

I think the case that is wrong looks like the most tricky, because AndContext pulls together all "connected" and-like operations and groups the associated "nots" and "maybes", so in the general case it needs to build a complex tree of different postlist types and so handle the corresponding EstimateOps too.

comment:10 by Robert Stepanek, 2 weeks ago

Thanks for this. I don't have anything to add to this analysis. I applied the candidate fix and did a bunch of local tests, all of which succeeded without invalid memory writes. I will now apply this patch to a broader set of real-world queries in a test environment.

Note: See TracTickets for help on using tickets.