Opened 12 years ago

Closed 11 years ago

#216 closed defect (fixed)

Inconsistent return values for percentage weights

Reported by: richard Owned by: olly
Priority: normal Milestone: 1.0.11
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by olly)

When results are being sorted primarily by an order other than relevance (e.g. sort_by_value()), the percentage values returned by the MSet object may be incorrect because they are calculated based on the document in the portion of the MSet requested which has the highest weight, instead of the document matching the query which has the highest weight.

This issue has existed in all previous Xapian releases, as far as we can tell.

There is currently no fix in progress, since it is probably not possible to fix without significant loss of efficiency, which would adversely affect users who aren't interested in the percentage scores.

If you really need percentage scores in this situation, one workaround would be to first run the search using relevance order, asking for only the top document, and to remember the weight and percentage assigned to that document. Then, re-run the search in sorted order, and calculate the percentages yourself from the weights assigned to the results, using this information.

A testcase demonstrating this is attached to this ticket.

The issue is that in multimatch.cc, we calculate "best" by looking for the highest weighted document in the candidate mset, but when sorting by anything other than relevance, the highest weighted document may have been discarded already.

It is hard to see how to fix this - one obvious approach would be to check every candidate document's weight before discarding it during the match process, and keep track the docid of the document with the highest weight seen so far. However, we currently don't calculate the weight for all the documents we see (because we first check the document against the lowest document in the mset using mcmp), so this would force us to calculate the weights on documents we wouldn't otherwise need to calculate it for. Since the percentages aren't necessarily even wanted, this seems a shame.

Perhaps a reasonable approach would be to add a flag on enquire which governed whether percentages were wanted or not; it would then be more reasonable to go to extra effort to keep track of the highest weighted document if the percentages were actually desired.

Attachments (4)

calcpercent.patch (25.7 KB) - added by richard 11 years ago.
Patch which adds a method to allow percent calculation to be disabled.
bestpercentbug.patch (4.2 KB) - added by richard 11 years ago.
Updated testcase
bug216-testcase.patch (2.4 KB) - added by olly 11 years ago.
Updated testcase patch for current SVN trunk
bug216-fix.patch (1.9 KB) - added by olly 11 years ago.
Patch to fix this issue

Download all attachments as: .zip

Change History (24)

comment:1 Changed 12 years ago by olly

  • Blocked By 200 added
  • Status changed from new to assigned

Having a flag certainly sounds reasonable, and would allow us to save ever looking at the termlist after the match.

We don't currently have a "flags" mechanism for Enquire - not sure if this would be best as a "set_flags()" or "set_want_percentages()" or what.

Perhaps in the short term we should see what the overhead is of just always calculating the weight when we need it to get this right?

Marking as blocking 1.0.5 - I don't think we have to fix it for then, but we should at least make a concious decision not to.

comment:2 Changed 12 years ago by olly

  • Blocked By 200 removed
  • Blocking 200 added
  • Operating System set to All

Marking as blocking bug#200, not as blocked by it.

comment:4 Changed 12 years ago by richard

  • Description modified (diff)
  • Milestone set to 1.1

comment:5 Changed 12 years ago by richard

  • Milestone changed from 1.1.0 to 1.0.7

comment:6 Changed 12 years ago by richard

  • Blocking 200 removed

comment:7 Changed 11 years ago by olly

  • Owner changed from newbugs to richard
  • Status changed from assigned to new

Changed 11 years ago by richard

Patch which adds a method to allow percent calculation to be disabled.

comment:8 Changed 11 years ago by richard

I don't think this bug can be fixed with the current API without compromising efficiency quite significantly. Therefore, it requires calcpercent.patch to be applied to allow searches to be performed without the percentage calculation code.

However, calcpercent.patch defaults to not calculating percentages. If this was applied to 1.0.7, it would break existing applications which require percentages. Therefore, we would need to make the default be calculation of percentages. If we then applied a fix for the bug, any existing application which didn't turn off percentage calculation would suffer a performance drop.

Since this bug seems to be fairly benign (I found it when reading the code, rather than having a problem in a real application), I suggest simply bumping this issue to 1.1.x, and documenting that percentage results in 1.0.x are unreliable when not sorting by relevance.

comment:9 Changed 11 years ago by richard

  • Status changed from new to assigned

comment:10 Changed 11 years ago by olly

  • Milestone changed from 1.0.7 to 1.0.8
  • Owner changed from richard to olly
  • Status changed from assigned to new

I think bumping sounds reasonable, but I'd like to take a look myself first, so marking for 1.0.8 for now and claiming this to remind myself I need to look. Also, if we do bump, the issue should be documented as you say.

comment:11 Changed 11 years ago by olly

  • Description modified (diff)
  • Status changed from new to assigned

Merge the ReleaseNotes entry from 1.0.7 into the description to try to keep information about this issue in one place.

comment:12 Changed 11 years ago by olly

  • Milestone changed from 1.0.8 to 1.0.9

Still not had a chance to look, bumping to 1.0.9.

Changed 11 years ago by richard

Updated testcase

comment:13 Changed 11 years ago by olly

  • Milestone changed from 1.0.9 to 1.0.10

Bumping to 1.0.10...

Changed 11 years ago by olly

Updated testcase patch for current SVN trunk

Changed 11 years ago by olly

Patch to fix this issue

comment:14 Changed 11 years ago by olly

  • Milestone changed from 1.0.10 to 1.0.11

OK, bug216-fix.patch fixes this issue. However:

  • It means we calculate the weight in potentially many more cases, which could be slow so we probably also want an API method to say that percentages are actually wanted as mentioned above. However, that's not really suitable for 1.0.x - we'd have to default to calculating percentages so as not to break compatibility with existing code, and anyone not updating their code to call the new method would get penalised.
  • There's an issue where we look at the weight of documents which a MatchDecider might reject if offered it. This means that to be pedantically correct, we need to pass more documents to any MatchDecider when sorting (partly) by value, which could incur quite an overhead. Currently my patch ignore this with the result that the percentage weights returned might be lower than they should be in this case.

So the big question of whether we fix this for 1.0.x still remains (the patch is certainly simple enough to consider). We do at least now have a patch to measure the overhead of if we think it might be worth backporting.

But I guess we'll probably have to pass on this for 1.0.10 if I want to release it this weekend though, so bumping the milestone for now at least.

comment:15 Changed 11 years ago by richard

bug216-fix.patch certainly looks reasonable to me. I think it would be reasonable to just document the problem with MatchDeciders?, for now.

My feeling is that it's not really worth worrying about this for 1.0.x, anyway, so we should just bump it to 1.1.0, and add the method to enable percentage calculation at the same time, with a default of not calculating percentages.

Some profiling of how long the extra weight calculation takes would be interesting, but I don't think this issue is important enough to hold up work on 1.0.x (or, indeed, the work trying to get 1.1.0 out).

comment:16 Changed 11 years ago by olly

I think we now have a real world sighting of this bug:

http://thread.gmane.org/gmane.comp.search.xapian.general/6956

comment:17 Changed 11 years ago by richard

Just to note that since revision [11822] (on trunk) we now throw an UnimplementedError? if we're asked for a percentage cutoff and to sort primarily by value.

I'm tending towards the feeling that percentages, calculated in the way we do, are more trouble (in terms of code complexity, and developer time) than they're worth. We could change the calculation of percentages to be based on the maxweight value (and, with improvements in the statistics held, we should be able to start getting tighter bounds on maxweight), and remove a lot of special-case code in the matcher which handles changes in percentage cutoff weights.

We could also provide an interface which returns the term weights for each of the terms in a query (generally useful).

If users require a "precise" percentage calculated in the current way, they could get hold of the weight of the top document (either by asking for it to be included in the mset, if they're doing a relevance-sorted search, or by performing a separate search specifically for it), calculate the normalised percentage for it using the term weights (and get_matching_terms()), and then perform a search and calculate the percentages from that search. Percentage cutoffs could also be done using weights in a similar manner.

We could provide some helper classes/code to help users to implement this sort of scheme, but my feeling is that pulling it out of the matcher would be a big win.

I don't think it would be unreasonable to experiment with this approach in the 1.1 release series.

comment:18 Changed 11 years ago by olly

Even with better bounds on the weight, we're rarely going to get the attained maximum equalling the bound. And users are (quite reasonably) going to be surprised if 100% is not attained in most cases where a document matches all the terms in the query. I don't think that approach is really workable.

There isn't actually a lot of special case code for changes to percentage cutoff weights - that's all handled in a single if statement. But it does feature quite a lot in the post-match adjusting of bounds, etc.

Several people have seemed confused that two matches with the same percentage score don't count as having the same "relevance" so I think if we're going to give up on the current approach to percentages, we should just leave it to the user to calculate them if they want them. Ironically this is how it used to be - the percentage calculating code was originally part of Omega.

We could also provide an interface which returns the term weights for each of the terms in a query (generally useful).

Already exists - MSet::get_termweight(tname)!

they could get hold of the weight of the top document (either by asking for it to be included in the mset, if they're doing a relevance-sorted search [...])

MSet::get_max_attained() should report that weight even if first > 0.

Anyway, for the moment, the question is really if we should apply this patch for 1.0.11. I'll try running some tests to see if the patch slows things down much.

comment:19 Changed 11 years ago by olly

The patch makes very little speed difference in my tests (<0.1%) so I think it's worth applying.

So, fixed in trunk r12059. I just documented the remaining issue when using a matchdecider too.

comment:20 Changed 11 years ago by olly

Oops, we want the regression test too - applied that in r12060.

comment:21 Changed 11 years ago by olly

  • Resolution set to fixed
  • Status changed from assigned to closed

Backported for 1.0.11 in r12061.

Note: See TracTickets for help on using tickets.