Opened 16 years ago

Closed 16 years ago

#307 closed enhancement (wontfix)

OP_SYNONYM should allow field weights to be set on its members

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone: 1.1.1
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Currently, OP_SYNONYM simply adds up the wdfs of the member terms to calculate the wdf of the resulting item. However, it might be useful to allow OP_SYNONYM to be told to modify the wdfs of the member terms (using an integer multiplier). This would correspond functionally to the "weight" function implemented in scriptindex, which allows the weight of a field to be modified at index time by multiplying the wdf, but would allow the weights to be adjusted at search-time without requiring a re-index.

I'm not sure where in the API this would fit, but it's something to consider before merging the OP_SYNONYM branch to trunk.

Change History (12)

comment:1 by Richard Boulton, 16 years ago

Blocked By: 50 added
Status: newassigned

comment:2 by Olly Betts, 16 years ago

Is OP_SYNONYM with OP_SCALE_WEIGHT applied to the subqueries usefully close to this?

If you're going to allow scaling of wdf values like this, then you probably need to adjust the document length correspondingly, which seems to be an ugly direction. Otherwise you'll violate inequalities I think we currently rely on for optimising, such as: wdf(term,doc) <= doclen(doc)

comment:3 by Richard Boulton, 16 years ago

I can't see how you could apply SCALE_WEIGHT to the subqueries, since their weights aren't used: in OP_SYNOYNM, the weights are generated from the combined term frequency estimate and the sum of the wdfs - not by combining the weights of subqueries.

You're quite right that we'll need to be careful not to violate the inequalities, though - I hadn't thought of that.

comment:4 by Olly Betts, 16 years ago

Ah yes, OP_SCALE_WEIGHT doesn't help here...

comment:5 by Olly Betts, 16 years ago

Description: modified (diff)

comment:6 by Olly Betts, 16 years ago

Bumping milestone to 1.1.1 as this is ready to apply and isn't an incompatible change.

comment:7 by Olly Betts, 16 years ago

Milestone: 1.1.01.1.1

comment:8 by Richard Boulton, 16 years ago

There isn't actually a fix for this one ready to apply, but it's fine to bump this to 1.1.1, anyway.

comment:9 by Richard Boulton, 16 years ago

Just to note - doing this could cause a problem, because we're trying to ensure that the wdf is no greater than the document length; currently, we're clamping the wdf to the doclength, but hope that the clamp doesn't actually apply too often - if we had multipliers for the wdf, we'd be likely to end up clamping the resulting values quite often. We'd also like to work out a nicer scheme for calculating the wdf, such that it doesn't exceed the doclength (see ticket #360).

I think this ticket might be best closed as WONTFIX, unless we can come up with some cunning workaround.

comment:10 by Olly Betts, 16 years ago

I think the only sane way for this to work is for the multipliers to be real numbers in the range (0,1] - then you sum the wdfs multiplied by these and clamp to the nearest integer. This could also perhaps help handle the "term occurs more than once" issue - count the occurrences up of each term front, then scale the scale factors for each occurrence of a term by 1/number_of_occurences_of_that_term. Not an ideal solution though, I feel, but perhaps the best there is.

Aside from facilitating that potential fix, I'm not clear what this is useful for ("it might be useful" seems the only justification), so I'm inclined to WONTFIX too, unless/until we have some compelling uses for this.

I don't think this needs to block a merge anyhow.

comment:11 by Richard Boulton, 16 years ago

Blocked By: 50 removed

That potential fix would be done internally, though - the thinking behind this ticket was really to provide an API to allow OP_SYNONYM to more closely follow what would happen if you performed the synonym expansion at index time in a system with wdf multipliers for each field (as in omega), but allowing the relative weighting of fields to be adjusted at search time instead of index time. I don't have a concrete example of wanting to use it in this way, though - it was just a feeling that it would be nice to be "feature complete" here.

I think if the multipliers are in the range (0,1] this wouldn't work well; if you wanted term "title:A" to have 5 times the wdf of term "body:B", you'd implement that by dividing the wdfs of "body:B" by 5 and you'd lose any distinction between documents with fairly different values (eg, 1 and 5) for the wdf of "body:B".

An alternative approach to making one field more important is to use a OR query instead of a synonym query. The weightings which come out of doing that are different, of course (that's the point of OP_SYNONYM), but you could use an OP_SYNONYM as the main query, and an OR query with a fairly low multiplier to give a slight boost to a particular field, if you really needed to adjust the weight in this sort of way.

I think we're both agreed the WONTFIX is an appropriate response, given the difficulty of coming up with a reasonable implementation, and given the lack of definite use for this. Closing as that.

comment:12 by Richard Boulton, 16 years ago

Resolution: wontfix
Status: assignedclosed
Note: See TracTickets for help on using tickets.