Opened 16 years ago
Closed 16 years ago
#307 closed enhancement (wontfix)
OP_SYNONYM should allow field weights to be set on its members
Reported by: | Richard Boulton | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | normal | Milestone: | 1.1.1 |
Component: | Matcher | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
Currently, OP_SYNONYM simply adds up the wdfs of the member terms to calculate the wdf of the resulting item. However, it might be useful to allow OP_SYNONYM to be told to modify the wdfs of the member terms (using an integer multiplier). This would correspond functionally to the "weight" function implemented in scriptindex, which allows the weight of a field to be modified at index time by multiplying the wdf, but would allow the weights to be adjusted at search-time without requiring a re-index.
I'm not sure where in the API this would fit, but it's something to consider before merging the OP_SYNONYM branch to trunk.
Change History (12)
comment:1 by , 16 years ago
Blocked By: | 50 added |
---|---|
Status: | new → assigned |
comment:2 by , 16 years ago
comment:3 by , 16 years ago
I can't see how you could apply SCALE_WEIGHT to the subqueries, since their weights aren't used: in OP_SYNOYNM, the weights are generated from the combined term frequency estimate and the sum of the wdfs - not by combining the weights of subqueries.
You're quite right that we'll need to be careful not to violate the inequalities, though - I hadn't thought of that.
comment:5 by , 16 years ago
Description: | modified (diff) |
---|
comment:6 by , 16 years ago
Bumping milestone to 1.1.1 as this is ready to apply and isn't an incompatible change.
comment:7 by , 16 years ago
Milestone: | 1.1.0 → 1.1.1 |
---|
comment:8 by , 16 years ago
There isn't actually a fix for this one ready to apply, but it's fine to bump this to 1.1.1, anyway.
comment:9 by , 16 years ago
Just to note - doing this could cause a problem, because we're trying to ensure that the wdf is no greater than the document length; currently, we're clamping the wdf to the doclength, but hope that the clamp doesn't actually apply too often - if we had multipliers for the wdf, we'd be likely to end up clamping the resulting values quite often. We'd also like to work out a nicer scheme for calculating the wdf, such that it doesn't exceed the doclength (see ticket #360).
I think this ticket might be best closed as WONTFIX, unless we can come up with some cunning workaround.
comment:10 by , 16 years ago
I think the only sane way for this to work is for the multipliers to be real numbers in the range (0,1] - then you sum the wdfs multiplied by these and clamp to the nearest integer. This could also perhaps help handle the "term occurs more than once" issue - count the occurrences up of each term front, then scale the scale factors for each occurrence of a term by 1/number_of_occurences_of_that_term. Not an ideal solution though, I feel, but perhaps the best there is.
Aside from facilitating that potential fix, I'm not clear what this is useful for ("it might be useful" seems the only justification), so I'm inclined to WONTFIX too, unless/until we have some compelling uses for this.
I don't think this needs to block a merge anyhow.
comment:11 by , 16 years ago
Blocked By: | 50 removed |
---|
That potential fix would be done internally, though - the thinking behind this ticket was really to provide an API to allow OP_SYNONYM to more closely follow what would happen if you performed the synonym expansion at index time in a system with wdf multipliers for each field (as in omega), but allowing the relative weighting of fields to be adjusted at search time instead of index time. I don't have a concrete example of wanting to use it in this way, though - it was just a feeling that it would be nice to be "feature complete" here.
I think if the multipliers are in the range (0,1] this wouldn't work well; if you wanted term "title:A" to have 5 times the wdf of term "body:B", you'd implement that by dividing the wdfs of "body:B" by 5 and you'd lose any distinction between documents with fairly different values (eg, 1 and 5) for the wdf of "body:B".
An alternative approach to making one field more important is to use a OR query instead of a synonym query. The weightings which come out of doing that are different, of course (that's the point of OP_SYNONYM), but you could use an OP_SYNONYM as the main query, and an OR query with a fairly low multiplier to give a slight boost to a particular field, if you really needed to adjust the weight in this sort of way.
I think we're both agreed the WONTFIX is an appropriate response, given the difficulty of coming up with a reasonable implementation, and given the lack of definite use for this. Closing as that.
comment:12 by , 16 years ago
Resolution: | → wontfix |
---|---|
Status: | assigned → closed |
Is OP_SYNONYM with OP_SCALE_WEIGHT applied to the subqueries usefully close to this?
If you're going to allow scaling of wdf values like this, then you probably need to adjust the document length correspondingly, which seems to be an ugly direction. Otherwise you'll violate inequalities I think we currently rely on for optimising, such as:
wdf(term,doc) <= doclen(doc)