Opened 16 years ago

Closed 16 years ago

Last modified 13 years ago

#363 closed defect (fixed)

Avoid using the termlist when calculating percentages in get_mset()

Reported by: Richard Boulton Owned by: Olly Betts
Priority: high Milestone: 1.1.2
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: #181 Operating System: All

Description

Currently, get_mset() uses the termlist for assigning percentages; it iterates through the termlist looking up the weight for each term in it and in the query, and adding them together to get a value for the weight for 100%.

This requires an access to the termlist (which can be quite costly) and also doesn't work correctly for synonyms, since these aren't in the termlist.

This could be avoided by keeping track of which Xapian::Weight::Internal objects were used for the best msetitem, or just of the sum of the termweights for these.

This information would need to be serialised along with the MSet.

Attachments (1)

xapian-percent-scaling-without-termlists.patch (23.4 KB ) - added by Olly Betts 16 years ago.
Patch which implements my idea

Download all attachments as: .zip

Change History (11)

comment:1 by Richard Boulton, 16 years ago

Status: newassigned

Note - trunk is currently failing when assertions are on in the synonym4 test in apitest due to this bug.

comment:2 by Richard Boulton, 16 years ago

One method for fixing this:

  • Keep a vector of all the weight objects generated when building the query.
  • Tell each leaf postlist (or weight object) the index of its weight object in this list.
  • Introduce a vector<bool> which is used to keep track of which weight objects have been used.
  • Clear this before calling get_weight() on the top postlist from get_mset().
  • Whenever a leaf postlist's get_weight() method is called, set the appropriate bit.
  • If the result is a new top-doc in get_mset(), copy its vector<bool> to a safe place.
  • Use the vectors to calculate the weight.

Risk: clearing and setting the bits in the vector for every match might be overly CPU intensive.

comment:3 by Olly Betts, 16 years ago

Feel free to try it, but I think that's going to be measurably slower.

On the plus side, note that this can potentially be optimised for AND queries with all leaf children - e.g. MultiAndPostList of terms knows for certain that all its sub-postlists will match, so no need to recurse them - we can store a mask to | in.

I think we would probably want a specialised class rather than vector<bool> as we don't need dynamic resizing (just a runtime-specified fixed size, which rules out bitset<> which has a compile-time constant fixed size) but we do want the ability to efficiently do: v1 |= v2;

But vector<bool> should be fine to prototype what sort of overhead there is here.

Perhaps better, be lazy about this and only do it for a match which is the current best via a new method. This has a bad worst case though - if the weights increase with docid it's even more work than the non-lazy approach as there are a lot of extra virtual method calls.

This could be optional on a "want percentages" flag, but it would be nice to use this for "matching terms" too. Which would require calculating and keeping these for anything in the proto-MSet...

I don't think we want to rush into this, but we clearly can't ship 1.1.1 with the testsuite failing assertions. So I think it's better to find a quick fix for the synonym4 assertion failure, perhaps along the lines of the (non-working) patch I tried.

comment:4 by Richard Boulton, 16 years ago

I fixed the assertion failure in r12623; essentially by forcing the percentage of the top document to be 100% if no non-synonym terms match it. However, this isn't a very desirable result, so some alternative way to fix it would be good.

comment:5 by Olly Betts, 16 years ago

Milestone: 1.1.11.1.3

comment:6 by Olly Betts, 16 years ago

Owner: changed from Richard Boulton to Olly Betts
Status: assignednew

I've had an idea...

We could just count the number of matching leaves and divide by the (original) total to get the factor to scale down the top match by. This isn't quite as good as the current approach, which scales down by more if a higher scoring term is missing, but it's simpler, and percentages are rather arbitrary anyway. We also suspected they are less popular than they were.

Having a look at this idea, so reassigning the bug, at least for now.

by Olly Betts, 16 years ago

Patch which implements my idea

comment:7 by Olly Betts, 16 years ago

Milestone: 1.1.31.1.2
Status: newassigned

OK, I have a patch, and it passes the testsuite with changes to two tests:

  • topercent2 needs a tweak because we generate different percentages
  • synonym4 needs a tweak to disable a check that we attain 100% for OP_XOR of a term and a synonym of two terms.

The former seems fine to me.

As for the latter, unless I'm missing something, it's actually incorrect that we give 100% currently. Or at least inconsistent - perhaps a XOR b should give 100% if the XOR matches, but it doesn't currently.

Although the patch currently passes all tests, I think the remote case may need further work. I don't think the testsuite cover multi of local and remote (we could add support for this as another multi variant, though the combinatorial explosion means we need to be careful about how exactly we do this...) Comments are still welcome.

Setting milestone to 1.1.2 since we have a patch of sorts - if we omit this from 1.1.2 it should be a concious choice.

comment:8 by Olly Betts, 16 years ago

Priority: normalhigh

comment:9 by Olly Betts, 16 years ago

Patch committed with one change - I added a check that XOR with a SYNONYM subquery does not achieve 100%.

Leaving this open for now as there may be a few remaining issues to deal with.

comment:10 by Olly Betts, 16 years ago

Resolution: fixed
Status: assignedclosed

OK, I think I've now dealt with the remaining issues.

Note: See TracTickets for help on using tickets.