Opened 13 years ago

Closed 11 years ago

#216 closed defect (fixed)

Inconsistent return values for percentage weights

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.0.11
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

When results are being sorted primarily by an order other than relevance (e.g. sort_by_value()), the percentage values returned by the MSet object may be incorrect because they are calculated based on the document in the portion of the MSet requested which has the highest weight, instead of the document matching the query which has the highest weight.

This issue has existed in all previous Xapian releases, as far as we can tell.

There is currently no fix in progress, since it is probably not possible to fix without significant loss of efficiency, which would adversely affect users who aren't interested in the percentage scores.

If you really need percentage scores in this situation, one workaround would be to first run the search using relevance order, asking for only the top document, and to remember the weight and percentage assigned to that document. Then, re-run the search in sorted order, and calculate the percentages yourself from the weights assigned to the results, using this information.

A testcase demonstrating this is attached to this ticket.

The issue is that in multimatch.cc, we calculate "best" by looking for the highest weighted document in the candidate mset, but when sorting by anything other than relevance, the highest weighted document may have been discarded already.

It is hard to see how to fix this - one obvious approach would be to check every candidate document's weight before discarding it during the match process, and keep track the docid of the document with the highest weight seen so far. However, we currently don't calculate the weight for all the documents we see (because we first check the document against the lowest document in the mset using mcmp), so this would force us to calculate the weights on documents we wouldn't otherwise need to calculate it for. Since the percentages aren't necessarily even wanted, this seems a shame.

Perhaps a reasonable approach would be to add a flag on enquire which governed whether percentages were wanted or not; it would then be more reasonable to go to extra effort to keep track of the highest weighted document if the percentages were actually desired.

Attachments (4)

calcpercent.patch (25.7 KB ) - added by Richard Boulton 12 years ago.
Patch which adds a method to allow percent calculation to be disabled.
bestpercentbug.patch (4.2 KB ) - added by Richard Boulton 12 years ago.
Updated testcase
bug216-testcase.patch (2.4 KB ) - added by Olly Betts 11 years ago.
Updated testcase patch for current SVN trunk
bug216-fix.patch (1.9 KB ) - added by Olly Betts 11 years ago.
Patch to fix this issue

Download all attachments as: .zip

Change History (24)

comment:1 by Olly Betts, 13 years ago

Blocked By: 200 added
Status: newassigned

Having a flag certainly sounds reasonable, and would allow us to save ever looking at the termlist after the match.

We don't currently have a "flags" mechanism for Enquire - not sure if this would be best as a "set_flags()" or "set_want_percentages()" or what.

Perhaps in the short term we should see what the overhead is of just always calculating the weight when we need it to get this right?

Marking as blocking 1.0.5 - I don't think we have to fix it for then, but we should at least make a concious decision not to.

comment:2 by Olly Betts, 12 years ago

Blocked By: 200 removed
Blocking: 200 added
Operating System: All

Marking as blocking bug#200, not as blocked by it.

comment:4 by Richard Boulton, 12 years ago

Description: modified (diff)
Milestone: 1.1

comment:5 by Richard Boulton, 12 years ago

Milestone: 1.1.01.0.7

comment:6 by Richard Boulton, 12 years ago

Blocking: 200 removed

comment:7 by Olly Betts, 12 years ago

Owner: changed from New Bugs to Richard Boulton
Status: assignednew

by Richard Boulton, 12 years ago

Attachment: calcpercent.patch added

Patch which adds a method to allow percent calculation to be disabled.

comment:8 by Richard Boulton, 12 years ago

I don't think this bug can be fixed with the current API without compromising efficiency quite significantly. Therefore, it requires calcpercent.patch to be applied to allow searches to be performed without the percentage calculation code.

However, calcpercent.patch defaults to not calculating percentages. If this was applied to 1.0.7, it would break existing applications which require percentages. Therefore, we would need to make the default be calculation of percentages. If we then applied a fix for the bug, any existing application which didn't turn off percentage calculation would suffer a performance drop.

Since this bug seems to be fairly benign (I found it when reading the code, rather than having a problem in a real application), I suggest simply bumping this issue to 1.1.x, and documenting that percentage results in 1.0.x are unreliable when not sorting by relevance.

comment:9 by Richard Boulton, 12 years ago

Status: newassigned

comment:10 by Olly Betts, 12 years ago

Milestone: 1.0.71.0.8
Owner: changed from Richard Boulton to Olly Betts
Status: assignednew

I think bumping sounds reasonable, but I'd like to take a look myself first, so marking for 1.0.8 for now and claiming this to remind myself I need to look. Also, if we do bump, the issue should be documented as you say.

comment:11 by Olly Betts, 12 years ago

Description: modified (diff)
Status: newassigned

Merge the ReleaseNotes entry from 1.0.7 into the description to try to keep information about this issue in one place.

comment:12 by Olly Betts, 12 years ago

Milestone: 1.0.81.0.9

Still not had a chance to look, bumping to 1.0.9.

by Richard Boulton, 12 years ago

Attachment: bestpercentbug.patch added

Updated testcase

comment:13 by Olly Betts, 12 years ago

Milestone: 1.0.91.0.10

Bumping to 1.0.10...

by Olly Betts, 11 years ago

Attachment: bug216-testcase.patch added

Updated testcase patch for current SVN trunk

by Olly Betts, 11 years ago

Attachment: bug216-fix.patch added

Patch to fix this issue

comment:14 by Olly Betts, 11 years ago

Milestone: 1.0.101.0.11

OK, bug216-fix.patch fixes this issue. However:

  • It means we calculate the weight in potentially many more cases, which could be slow so we probably also want an API method to say that percentages are actually wanted as mentioned above. However, that's not really suitable for 1.0.x - we'd have to default to calculating percentages so as not to break compatibility with existing code, and anyone not updating their code to call the new method would get penalised.
  • There's an issue where we look at the weight of documents which a MatchDecider might reject if offered it. This means that to be pedantically correct, we need to pass more documents to any MatchDecider when sorting (partly) by value, which could incur quite an overhead. Currently my patch ignore this with the result that the percentage weights returned might be lower than they should be in this case.

So the big question of whether we fix this for 1.0.x still remains (the patch is certainly simple enough to consider). We do at least now have a patch to measure the overhead of if we think it might be worth backporting.

But I guess we'll probably have to pass on this for 1.0.10 if I want to release it this weekend though, so bumping the milestone for now at least.

comment:15 by Richard Boulton, 11 years ago

bug216-fix.patch certainly looks reasonable to me. I think it would be reasonable to just document the problem with MatchDeciders, for now.

My feeling is that it's not really worth worrying about this for 1.0.x, anyway, so we should just bump it to 1.1.0, and add the method to enable percentage calculation at the same time, with a default of not calculating percentages.

Some profiling of how long the extra weight calculation takes would be interesting, but I don't think this issue is important enough to hold up work on 1.0.x (or, indeed, the work trying to get 1.1.0 out).

comment:16 by Olly Betts, 11 years ago

I think we now have a real world sighting of this bug:

http://thread.gmane.org/gmane.comp.search.xapian.general/6956

comment:17 by Richard Boulton, 11 years ago

Just to note that since revision [11822] (on trunk) we now throw an UnimplementedError if we're asked for a percentage cutoff and to sort primarily by value.

I'm tending towards the feeling that percentages, calculated in the way we do, are more trouble (in terms of code complexity, and developer time) than they're worth. We could change the calculation of percentages to be based on the maxweight value (and, with improvements in the statistics held, we should be able to start getting tighter bounds on maxweight), and remove a lot of special-case code in the matcher which handles changes in percentage cutoff weights.

We could also provide an interface which returns the term weights for each of the terms in a query (generally useful).

If users require a "precise" percentage calculated in the current way, they could get hold of the weight of the top document (either by asking for it to be included in the mset, if they're doing a relevance-sorted search, or by performing a separate search specifically for it), calculate the normalised percentage for it using the term weights (and get_matching_terms()), and then perform a search and calculate the percentages from that search. Percentage cutoffs could also be done using weights in a similar manner.

We could provide some helper classes/code to help users to implement this sort of scheme, but my feeling is that pulling it out of the matcher would be a big win.

I don't think it would be unreasonable to experiment with this approach in the 1.1 release series.

comment:18 by Olly Betts, 11 years ago

Even with better bounds on the weight, we're rarely going to get the attained maximum equalling the bound. And users are (quite reasonably) going to be surprised if 100% is not attained in most cases where a document matches all the terms in the query. I don't think that approach is really workable.

There isn't actually a lot of special case code for changes to percentage cutoff weights - that's all handled in a single if statement. But it does feature quite a lot in the post-match adjusting of bounds, etc.

Several people have seemed confused that two matches with the same percentage score don't count as having the same "relevance" so I think if we're going to give up on the current approach to percentages, we should just leave it to the user to calculate them if they want them. Ironically this is how it used to be - the percentage calculating code was originally part of Omega.

We could also provide an interface which returns the term weights for each of the terms in a query (generally useful).

Already exists - MSet::get_termweight(tname)!

they could get hold of the weight of the top document (either by asking for it to be included in the mset, if they're doing a relevance-sorted search [...])

MSet::get_max_attained() should report that weight even if first > 0.

Anyway, for the moment, the question is really if we should apply this patch for 1.0.11. I'll try running some tests to see if the patch slows things down much.

comment:19 by Olly Betts, 11 years ago

The patch makes very little speed difference in my tests (<0.1%) so I think it's worth applying.

So, fixed in trunk r12059. I just documented the remaining issue when using a matchdecider too.

comment:20 by Olly Betts, 11 years ago

Oops, we want the regression test too - applied that in r12060.

comment:21 by Olly Betts, 11 years ago

Resolution: fixed
Status: assignedclosed

Backported for 1.0.11 in r12061.

Note: See TracTickets for help on using tickets.