Inconsistent return values for percentage weights
|Reported by:||Richard Boulton||Owned by:||Olly Betts|
Description (last modified by )
When results are being sorted primarily by an order other than relevance (e.g.
sort_by_value()), the percentage values returned by the MSet object may be incorrect because they are
calculated based on the document in the portion of the MSet
requested which has the highest weight, instead of the document matching the
query which has the highest weight.
This issue has existed in all previous Xapian releases, as far as we can tell.
There is currently no fix in progress, since it is probably not possible to fix without significant loss of efficiency, which would adversely affect users who aren't interested in the percentage scores.
If you really need percentage scores in this situation, one workaround would be to first run the search using relevance order, asking for only the top document, and to remember the weight and percentage assigned to that document. Then, re-run the search in sorted order, and calculate the percentages yourself from the weights assigned to the results, using this information.
A testcase demonstrating this is attached to this ticket.
The issue is that in multimatch.cc, we calculate "best" by looking for the highest weighted document in the candidate mset, but when sorting by anything other than relevance, the highest weighted document may have been discarded already.
It is hard to see how to fix this - one obvious approach would be to check every candidate document's weight before discarding it during the match process, and keep track the docid of the document with the highest weight seen so far. However, we currently don't calculate the weight for all the documents we see (because we first check the document against the lowest document in the mset using mcmp), so this would force us to calculate the weights on documents we wouldn't otherwise need to calculate it for. Since the percentages aren't necessarily even wanted, this seems a shame.
Perhaps a reasonable approach would be to add a flag on enquire which governed whether percentages were wanted or not; it would then be more reasonable to go to extra effort to keep track of the highest weighted document if the percentages were actually desired.