Opened 16 years ago
Closed 11 years ago
#360 closed defect (fixed)
SynonymPostList always requires doclength if wdf is used
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.18 |
Component: | Matcher | Version: | SVN trunk |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
SynonymPostList currently clamps computed wdf values to the document length. This is to ensure that the wdf does not exceed the document length, which is a condition that some weight schemes can rely on for computing tight bounds on the maximum weight.
It would be good to avoid having to calculate the doclength for weighting schemes which don't require the doclength, but do require the wdf. One approach for this would be to ensure that the wdf sum used in op synonym only counts each physical term once; though it is hard to do this duplicate removal in advance because query tree decay may remove some instances of a term being used while leaving others.
Change History (5)
comment:1 by , 16 years ago
Milestone: | 1.1.7 → 1.2.0 |
---|
comment:2 by , 13 years ago
Description: | modified (diff) |
---|---|
Milestone: | 1.2.x → 1.3.x |
I think this is probably difficult to fix as stated, and the contortions which would be needed are probably not worth the effort.
But we could add an OP_MAX
operator which acts like OP_OR
but returns the greatest weight of any subquery instead of summing them. This would act in a fairly similar way to OP_SYNONYM
, but wouldn't suffer from the issue here.
I suggested OP_MAX
previously without thinking about this issue, and we concluded it was probably useful to have.
comment:4 by , 11 years ago
Milestone: | 1.3.3 → 1.2.18 |
---|---|
Status: | new → assigned |
OK, so I've implemented OP_MAX, but in my tests with the etext db and all the terms starting "th" it is actually slower than OP_SYNONYM (at least under BM25), so that's not a great fix. OP_SYNONYM is faster than OP_OR in my tests, I think because the weight calculation doesn't require recursing all the subpostlists.
We can skip fetching the doclength if the wdf we calculated <= doclength_lower_bound for the current subdatabase, and that's a cheap check which should help, so I've implemented that in r17883. The other thing I can see that we can do relatively easily is handling the common case where OP_SYNONYM has only terms as subqueries and they're all different - I think in that case the estimated synonym wdf can't exceed the doclength.
I've also committed OP_MAX (since I implemented it) in r17884.
We should backport the doclength_lower_bound optimisation for 1.2.18 if it applies reasonably cleanly, so updating milestone to remind us to do that.
comment:5 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Backported the OP_SYNONYM optimisation in r17910.
BM25 with the default settings uses the document length, so this wouldn't change anything for most users, so bumping.