Opened 3 years ago

Closed 7 months ago

#810 closed defect (fixed)

LMWeight implementation issues

Reported by: Olly Betts Owned by: Olly Betts
Priority: highest Milestone: 1.5.0
Component: Library API Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

I've just been implementing an improvement to always use the per-shard document length bounds in weight calculations (previously we used the global bounds for local shards, but the remote-specific bounds for remote shards).

As part of this, I checked how the different weighting schemes use these bounds in case any expected them to be the same for every shard - and LMWeight does. That was already wrong with remote shards, but with my change becomes wrong more often.

We could make the global bounds available to address this, but looking more closely, I'm dubious about the implementation in more ways than just this.

The weight score is a product of probabilities over terms in the query, which we convert to a sum (as required by our weighting machinery) by taking log. So we calculate log(weight_document) which is log(product(probability_term)) for terms in the query which is sum(log(probability_term)).

There are two problems here.

One is that I'm fairly sure we're meant to sum over all terms in the query, not just those that match the current document. Not doing this is problematic because it means a document which matches fewer terms actually gets a higher score (because probabilities are in the range [0, 1]).

The other is that the log of a probability is always <= 0 and the term weight contributions have to be >= 0 in Xapian.

To get around this we currently actually multiply each term's probability by log_param before taking the log (which is equivalent to adding a constant after the log). The problem here is that this effectively adds a constant weight boost per matching term. That at least tends to counteract the first issue, but overall we really aren't implementing the weighting scheme we claim to be.

I notice that we could scale all document scores (in the original product form) by dividing by the product over all query terms of the probability of the term in the collection then for each document we can cancel the contributions for terms not matching that document and after taking the log we end up summing log(probability_document/probability_collection) over terms matching the document.

I'd think probability_document (which unsmoothed is wdf/doclen and smoothed will be somewhere between that and probability_collection) is usually greater than probability_collection (collection_freq/total_doc_length), so the result of this log will usually be positive too. If it isn't, then clamping to zero seems reasonable (as if this happens the term occurs in the document but less often than you'd expect by random chance).

One thing we really need to do is track down exactly which paper(s) the formulae used are from - a code comment only rather vaguely says "as described by the early papers on LM by Bruce Croft". I think it must be one which came after "A General Language Model for Information Retrieval" (Song and Croft 1999) - that does have a matching basic framework but doesn't describe all the smoothing options our implementation has, and it also relies on modelling the term distribution (which we don't do).

Marking with 1.5.0, though I don't see this as a blocker - this isn't the default weighting scheme and I think all of the issues noted above exist in the original implementation from 2014.

Change History (6)

comment:1 by Olly Betts, 3 years ago

Description: modified (diff)

comment:2 by Olly Betts, 3 years ago

https://github.com/xapian/xapian/pull/230 may be related (and should be resolved even if it isn't).

comment:3 by Olly Betts, 14 months ago

Marking with 1.5.0, though I don't see this as a blocker - this isn't the default weighting scheme and I think all of the issues noted above exist in the original implementation from 2014.

If we don't address this for 1.5.0 maybe we should remove LMWeight until this is resolved. We know the current implementation is wrong so it doesn't seem helpful to keep providing it.

comment:4 by Olly Betts, 14 months ago

Priority: normalhighest

comment:5 by Olly Betts, 8 months ago

Status: newassigned

Rechecking for the current code (and my WIP fix for the LMWeight incorrect implementation) these stats are for the current shard and made available to the term-independent weight contribution weight:

  • DOC_LENGTH_MAX
  • DOC_LENGTH_MIN

These weighting schemes include a term-independent weight:

  • BM25PlusWeight - safe
  • BM25Weight - safe
  • LMWeight - corrected versions of Dirichlet & Dir+, Absolute Discount, Two Stage (in WIP) use DOC_LENGTH_MAX in get_sumextra() (via cached value computed in init()
Last edited 8 months ago by Olly Betts (previous) (diff)

comment:6 by Olly Betts, 7 months ago

Resolution: fixed
Status: assignedclosed

I've implemented bounds for the whole DB in 900f0af7892343070a4b0a1258e84cf279ae9efc (via DB_DOC_LENGTH_MIN, etc). This is not suitable for backporting to 1.4.x.

The incorrect equations used by LMWeight is covered by #779 so closing.

Note: See TracTickets for help on using tickets.