Opened 9 years ago

Closed 6 years ago

#682 closed defect (fixed)

Collapsing when sorting by value unnecessarily slow

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: Matcher Version: 1.2.20
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Gareth Rees reported on IRC that collapsing when sorting by value could be very slow, and this got worse as offset was increased (approximately linear in offset from a graph he showed).

Background

Once the matcher has found enough matching results (where enough is offset + msize), it keeps the best offset + msize in a heap, with the worst one at the tip of the heap (sometimes called a "min heap").

Each new matching document is compared to the tip of that heap - if it's worse, it's discarded; if it's better, the current tip is popped and the new document added to the heap.

When collapsing, to replace a document with one with the same collapse key but which ranks more highly, we currently do a linear scan of the vector the heap is over to find the old document, and replace it in the vector with the better one (which means the vector is no longer a valid heap), and flag the heap as needing a rebuild before we next use it as a heap.

So if collapsing replaces a lot documents, the heap gets rebuilt a lot.

If offset or msize is set high, the heap is large (since it has msize + offset entries).

Sorting by value makes this worse - if you sort by relevance the match can often terminate early, which it can't for sorting by value. Also sorting by value probably makes the comparisons while rebuilding the heap more expensive.

Solution

There's a FIXME comment in matcher/multimatch.cc which notes this (in less detail):

                            // We can replace an arbitrary element in O(log N)
                            // but have to do it by hand (in this case the new
                            // elt is bigger, so we just swap down the tree).
                            // FIXME: implement this, and clean up is_heap
                            // handling
                            *i = new_item;
                            pushback = false;
                            is_heap = false;

As the comment notes, the replacement is possible in O(log N) operations, but this is not an operation which the heap in the C++ STL provides.

But if there's a suitably licensed heap template (boost may have something, or maybe one of the open source C++ STLs is suitably licensed) we could copy that in and add the operation we want to it.

Marking as found in 1.2.20, but the code has been this way for a long time. I don't think anyone has reported this as causing performance issues in practice before now.

Setting milestone 1.3.x initially.

Change History (5)

comment:1 by Olly Betts, 8 years ago

Milestone: 1.3.x1.4.x

Doesn't change the ABI, so can be done in 1.4.x.

comment:2 by Olly Betts, 6 years ago

Milestone: 1.4.x1.5.0
Status: newassigned

I've nearly finished reimplementing the matcher, and the new version uses a custom heap implementation which provides the needed operation mentioned above, and collapsing will use it.

This work should land in 1.5.0.

comment:3 by Olly Betts, 6 years ago

Link to logs from original report:

https://botbot.me/freenode/xapian/2015-06-09/?msg=41419627&page=1

The script used is still in the link given, trying to contact Gareth to find out if the suitable data is available - it'd be good to check what difference the changes make.

comment:4 by Olly Betts, 6 years ago

The matcher reimplementation has landed in master in [dfc8f93283d19a5a150418586de896df346b8301].

It's not really feasible to backport that change to 1.4.x as it builds on a series of other changes which are only in master, but if that change is confirmed as addressing this issue we could probably take the custom heap implementation in matcher/heap.h and make use of it in 1.4.x.

comment:5 by Olly Betts, 6 years ago

Resolution: fixed
Status: assignedclosed

That custom implementation is now in common/heap.h.

I did contact Gareth successfully shortly after the last comment and he was going to try to test with git master, but he's not got back to us since.

I had a quick look at the code on the RELEASE/1.4 branch. Two things happen there each time we find a better candidate for a collapse key:

  • We linearly scan through the array the heap is in until we find the previous candidate (or don't find it if it has been eliminated from the proto-MSet already). On git master there's another layer of indirection here which allows us to do this in O(1).
  • We flag the heap and in need of rebuilding next time we need to use it as a heap, and a call to make_heap needs most 3*N comparisons, and worst case is that we always have to rebuild. With the custom heap implementation we can use Heap::siftdown() to replace the element while keeping the heap valid in at most 2*log(N) comparisons.

Using the custom heap implementation looks fairly straightforward (I refitted it throughout git master recently and that wasn't hard) but that still leaves us with a worst case of O(m*n) where m is the requested MSet size and n the number of documents in the database.

I think trying to retrofit the additional layer of indirection into the RELEASE/1.4 branch would be too disruptive, especially as we're a long way into the 1.4 stable release series.

Changing the heap implementation would probably help, but it carries some risk for all users while fixing a problem which only users using collapsing, and not all those users even. And it is unlikely to deliver the substantial change which git master should.

So I think we don't try to backport anything for this issue.

Note: See TracTickets for help on using tickets.