Ticket #216 (assigned defect)

Opened 14 months ago

Last modified 3 weeks ago

Inconsistent return values for percentage weights

Reported by: richard Owned by: olly
Priority: normal Milestone: 1.0.11
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Operating System: All Blocking:

Description (last modified by olly) (diff)

When results are being sorted primarily by an order other than relevance (e.g. sort_by_value()), the percentage values returned by the MSet object may be incorrect because they are calculated based on the document in the portion of the MSet requested which has the highest weight, instead of the document matching the query which has the highest weight.

This issue has existed in all previous Xapian releases, as far as we can tell.

There is currently no fix in progress, since it is probably not possible to fix without significant loss of efficiency, which would adversely affect users who aren't interested in the percentage scores.

If you really need percentage scores in this situation, one workaround would be to first run the search using relevance order, asking for only the top document, and to remember the weight and percentage assigned to that document. Then, re-run the search in sorted order, and calculate the percentages yourself from the weights assigned to the results, using this information.

A testcase demonstrating this is attached to this ticket.

The issue is that in multimatch.cc, we calculate "best" by looking for the highest weighted document in the candidate mset, but when sorting by anything other than relevance, the highest weighted document may have been discarded already.

It is hard to see how to fix this - one obvious approach would be to check every candidate document's weight before discarding it during the match process, and keep track the docid of the document with the highest weight seen so far. However, we currently don't calculate the weight for all the documents we see (because we first check the document against the lowest document in the mset using mcmp), so this would force us to calculate the weights on documents we wouldn't otherwise need to calculate it for. Since the percentages aren't necessarily even wanted, this seems a shame.

Perhaps a reasonable approach would be to add a flag on enquire which governed whether percentages were wanted or not; it would then be more reasonable to go to extra effort to keep track of the highest weighted document if the percentages were actually desired.

Attachments

calcpercent.patch (25.7 kB) - added by richard 9 months ago.
Patch which adds a method to allow percent calculation to be disabled.
bestpercentbug.patch (4.2 kB) - added by richard 4 months ago.
Updated testcase
bug216-testcase.patch (2.4 kB) - added by olly 3 weeks ago.
Updated testcase patch for current SVN trunk
bug216-fix.patch (1.9 kB) - added by olly 3 weeks ago.
Patch to fix this issue

Change History

Changed 13 months ago by olly

  • status changed from new to assigned
  • blockedby set to 200

Having a flag certainly sounds reasonable, and would allow us to save ever looking at the termlist after the match.

We don't currently have a "flags" mechanism for Enquire - not sure if this would be best as a "set_flags()" or "set_want_percentages()" or what.

Perhaps in the short term we should see what the overhead is of just always calculating the weight when we need it to get this right?

Marking as blocking 1.0.5 - I don't think we have to fix it for then, but we should at least make a concious decision not to.

Changed 13 months ago by olly

  • blocking set to 200
  • blockedby deleted

Marking as blocking bug#200, not as blocked by it.

Changed 13 months ago by trac

  • platform set to All

Changed 9 months ago by richard

  • description modified (diff)
  • milestone set to 1.1

Changed 9 months ago by richard

  • milestone changed from 1.1.0 to 1.0.7

Changed 9 months ago by richard

  • blocking deleted

Changed 9 months ago by olly

  • owner changed from newbugs to richard
  • status changed from assigned to new

Changed 9 months ago by richard

Patch which adds a method to allow percent calculation to be disabled.

Changed 7 months ago by richard

I don't think this bug can be fixed with the current API without compromising efficiency quite significantly. Therefore, it requires calcpercent.patch to be applied to allow searches to be performed without the percentage calculation code.

However, calcpercent.patch defaults to not calculating percentages. If this was applied to 1.0.7, it would break existing applications which require percentages. Therefore, we would need to make the default be calculation of percentages. If we then applied a fix for the bug, any existing application which didn't turn off percentage calculation would suffer a performance drop.

Since this bug seems to be fairly benign (I found it when reading the code, rather than having a problem in a real application), I suggest simply bumping this issue to 1.1.x, and documenting that percentage results in 1.0.x are unreliable when not sorting by relevance.

Changed 7 months ago by richard

  • status changed from new to assigned

Changed 6 months ago by olly

  • owner changed from richard to olly
  • status changed from assigned to new
  • milestone changed from 1.0.7 to 1.0.8

I think bumping sounds reasonable, but I'd like to take a look myself first, so marking for 1.0.8 for now and claiming this to remind myself I need to look. Also, if we do bump, the issue should be documented as you say.

Changed 6 months ago by olly

  • status changed from new to assigned
  • description modified (diff)

Merge the ReleaseNotes entry from 1.0.7 into the description to try to keep information about this issue in one place.

Changed 4 months ago by olly

  • milestone changed from 1.0.8 to 1.0.9

Still not had a chance to look, bumping to 1.0.9.

Changed 4 months ago by richard

Updated testcase

Changed 3 months ago by olly

  • milestone changed from 1.0.9 to 1.0.10

Bumping to 1.0.10...

Changed 3 weeks ago by olly

Updated testcase patch for current SVN trunk

Changed 3 weeks ago by olly

Patch to fix this issue

Changed 3 weeks ago by olly

  • milestone changed from 1.0.10 to 1.0.11

OK, bug216-fix.patch fixes this issue. However:

  • It means we calculate the weight in potentially many more cases, which could be slow so we probably also want an API method to say that percentages are actually wanted as mentioned above. However, that's not really suitable for 1.0.x - we'd have to default to calculating percentages so as not to break compatibility with existing code, and anyone not updating their code to call the new method would get penalised.
  • There's an issue where we look at the weight of documents which a MatchDecider might reject if offered it. This means that to be pedantically correct, we need to pass more documents to any MatchDecider when sorting (partly) by value, which could incur quite an overhead. Currently my patch ignore this with the result that the percentage weights returned might be lower than they should be in this case.

So the big question of whether we fix this for 1.0.x still remains (the patch is certainly simple enough to consider). We do at least now have a patch to measure the overhead of if we think it might be worth backporting.

But I guess we'll probably have to pass on this for 1.0.10 if I want to release it this weekend though, so bumping the milestone for now at least.

Changed 3 weeks ago by richard

bug216-fix.patch certainly looks reasonable to me. I think it would be reasonable to just document the problem with MatchDeciders?, for now.

My feeling is that it's not really worth worrying about this for 1.0.x, anyway, so we should just bump it to 1.1.0, and add the method to enable percentage calculation at the same time, with a default of not calculating percentages.

Some profiling of how long the extra weight calculation takes would be interesting, but I don't think this issue is important enough to hold up work on 1.0.x (or, indeed, the work trying to get 1.1.0 out).

Note: See TracTickets for help on using tickets.