Opened 16 years ago

Closed 11 years ago

#360 closed defect (fixed)

SynonymPostList always requires doclength if wdf is used

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.2.18
Component: Matcher Version: SVN trunk
Severity: minor Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

SynonymPostList currently clamps computed wdf values to the document length. This is to ensure that the wdf does not exceed the document length, which is a condition that some weight schemes can rely on for computing tight bounds on the maximum weight.

It would be good to avoid having to calculate the doclength for weighting schemes which don't require the doclength, but do require the wdf. One approach for this would be to ensure that the wdf sum used in op synonym only counts each physical term once; though it is hard to do this duplicate removal in advance because query tree decay may remove some instances of a term being used while leaving others.

Change History (5)

comment:1 by Olly Betts, 16 years ago

Milestone: 1.1.71.2.0

BM25 with the default settings uses the document length, so this wouldn't change anything for most users, so bumping.

comment:2 by Olly Betts, 13 years ago

Description: modified (diff)
Milestone: 1.2.x1.3.x

I think this is probably difficult to fix as stated, and the contortions which would be needed are probably not worth the effort.

But we could add an OP_MAX operator which acts like OP_OR but returns the greatest weight of any subquery instead of summing them. This would act in a fairly similar way to OP_SYNONYM, but wouldn't suffer from the issue here.

I suggested OP_MAX previously without thinking about this issue, and we concluded it was probably useful to have.

comment:3 by Olly Betts, 11 years ago

Milestone: 1.3.x1.3.3

Let's at least implement OP_MAX soon.

comment:4 by Olly Betts, 11 years ago

Milestone: 1.3.31.2.18
Status: newassigned

OK, so I've implemented OP_MAX, but in my tests with the etext db and all the terms starting "th" it is actually slower than OP_SYNONYM (at least under BM25), so that's not a great fix. OP_SYNONYM is faster than OP_OR in my tests, I think because the weight calculation doesn't require recursing all the subpostlists.

We can skip fetching the doclength if the wdf we calculated <= doclength_lower_bound for the current subdatabase, and that's a cheap check which should help, so I've implemented that in r17883. The other thing I can see that we can do relatively easily is handling the common case where OP_SYNONYM has only terms as subqueries and they're all different - I think in that case the estimated synonym wdf can't exceed the doclength.

I've also committed OP_MAX (since I implemented it) in r17884.

We should backport the doclength_lower_bound optimisation for 1.2.18 if it applies reasonably cleanly, so updating milestone to remind us to do that.

Last edited 11 years ago by Olly Betts (previous) (diff)

comment:5 by Olly Betts, 11 years ago

Resolution: fixed
Status: assignedclosed

Backported the OP_SYNONYM optimisation in r17910.

Note: See TracTickets for help on using tickets.