#203 closed defect (released)
Add query operator to multiply weights by a factor
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Matcher | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Currently, the only way to manually modify the weights of particular terms is to adjust the wdf at the time they are indexed. This is useful, but limited. I would like to be able to modify the weights of particular terms at query construction time. I believe the easiest way to do this is to add a query operator which takes a parameter (as a double) and a single sub-query, and multiplies all the weights of the subquery by the parameter. This would allow arbitrary weight modifications at search time, by applying the operator to parts of the query. This could also be very useful for integrating "expanded" query terms with a lower importance than "user-entered" query terms, indicating differing levels of confidence in the importance of terms.
This may be a little controversial as a new feature, because I know of no theoretical model to explain why multiplying the weights of part of a query by a factor should be a useful thing to do. However, I've come across several places where some kind of search-time weight adjustment would be extremely useful, this is the simplest way I can think of to adjust weights, and my initial experiments with this have produced plausible-looking results.
I have produced an initial implementation of this feature, which I will attach to this bug. Ideally, I'd like to apply it to head shortly after we've released 1.0.3, so I'll mark the bug as blocking 1.0.4, for now.
Attachments (2)
Change History (15)
comment:1 by , 17 years ago
Blocking: | 200 added |
---|---|
Status: | new → assigned |
comment:2 by , 17 years ago
comment:3 by , 17 years ago
Blocked By: | 164 added |
---|
I had some comments on this, but didn't have time to write them before going away.
I think the name could be improved - it would be better to use "MULTIPLY" than a somewhat awkward abbreviation like "MULT", but I don't think multiplication is really the essence of this operation anyway, and something like "OP_SCALE_WEIGHT" seems better still.
The test against DBL_EPSILON to decide if we're boolean seems incorrect - I think just (dbl_parameter <= 0.0) is what is needed there. Actually, perhaps a negative scaling factor should result in InvalidArgumentError rather than being silently clipped to 0?
I think we should probably add this before OP_ELITE_SET so that we no longer have a "gappy" enum (unless there's an issue with doing so which I'm missing). The gap is due to OP_WEIGHT_CUTOFF (or whatever it was called) being removed, and we've definitely broken ABI compatibility since then.
Also, AssertNe(multiplier, 0) should probably just be written as Assert(multiplier != 0) - the assertion message only happens when x == 0, so using AssertNe() just means more code is generated for producing the assertion failure message without actually improving it at all.
comment:4 by , 17 years ago
I like the name OP_SCALE_WEIGHT much better - I've put together a patch to change it to that. For consistency, I've also renamed MultWeightPostList (and the corresponding source files).
I've also moved the operator to be before OP_ELITE_SET - I hadn't remembered why there was a gap there, so I just added to the end of the enum. It is indeed better to fill the gap.
Your comment about assertions makes sense, I'll bear that in mind in future (and I've changed it).
You're probably right about the DBL_EPSILON test, too: my thinking behind testing against DBL_EPSILON to decide if we're boolean was that handling a query as boolean, instead of multiplying its weight by a very small value, is just an optimisation. However, the ordering of documents whose weights are multiplied by DBL_EPSILON is is likely to be different from boolean sort order, so this was probably a mistake.
If we're not using the DBL_EPSILON test, it might well make sense to raise an exception for a negative value. My motivation for clipping negative values to zero was that, if we're doing a DBL_EPSILON test, it would be wrong to raise an exception for a negative value which is greater than -DBL_EPSILON. It seemed wrong to have an exception raised for larger negative values, if small negative values were permitted. This reasoning may not be that sound anyway, but if we're not going to do an epsilon test anyway, it's all moot.
So, I'm in favour of removing the DBL_EPSILON test against 0, and raising an exception if dbl_parameter < 0.
I still think we should keep the epsilon test in api/omqueryinternal.cc in simplify_query() comparing dbl_parameter to 1.0 - this _is_ just an optimisation, and if the value is very slightly different from 1.0 the effect on rankings will be negligible.
comment:5 by , 17 years ago
Not sure why this bug depends on #164 - Olly: can you comment? I'm wondering if making the dependency was a slip...
comment:6 by , 17 years ago
Blocked By: | 164 removed |
---|
Did I add the dependency on 164? Bugzilla doesn't seem to say, but if so it must have been finger trouble. Removed it anyway, as it's clearly bogus.
The DBL_EPSILON test in simplify_query() seems fine to me - that's the sort of use which the epsilon constants are intended for.
It would be good to ensure there's a decomposing test for a fraction which isn't exactly representable in base 2, and which when multiplied by its reciprocal doesn't give 1.0 exactly in IEEE double representation.
comment:7 by , 17 years ago
I was about to ask why this bug depended on #164 (it was set to do so by Olly a few hours ago)...
I've written and committed some tests for searches with parts of the query tree multiplied by 0 (and by negative numbers), and have written (but not yet tidied and committed) a few more tests of queries with different structures.
Therefore, the only significant outstanding feature needed before this bug can be closed is support for decomposing nested OP_SCALE_WEIGHT operators (this probably isn't desperately important, but I'd like to pull common multipliers out of OR or AND groups, so that the normal optimisations can work within those groups).
The network protocol should also be bumped (but just a minor bump) because an old server won't understand a query with the OP_SCALE_WEIGHT operator in it.
comment:8 by , 17 years ago
I think it's better to leave further optimisations of this for now. I'm looking at doing some reworking for bug#164 which may make this easier to do...
comment:9 by , 17 years ago
Oh - I spent a little time yesterday while my network connection was down implementing some optimisations for pulling scale-weight nodes up. I didn't quite finish it (I messed up somewhere, resulting in a memory leak for certain cases), but it was nearly working. I'm sure the tests in my patch will be useful at some point though, even if the implementation isn't, so I'll attach it to this bug shortly.
comment:10 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:11 by , 17 years ago
Cc: | removed |
---|---|
Owner: | changed from | to
Status: | assigned → new |
We decided on IRC it was best to do this by pushing the scaling factor(s) down to the Xapian::Weight objects. With a tiny tweak they can usually fold it into their own scaling factor(s), and for Xapian::Weight sub-classes which haven't been tweaked (i.e. user-implemented ones) we can add a "shim" class for compatibility.
Taking this bug, as I'm working on this. We should keep the new testcases from the patch I think.
comment:12 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Hmm, except that the test-cases check that Query::get_description() contains the simplifications, but we aren't simplifying until we build the PostList tree. I don't see how we can rework them though - if you see a way feel free to salvage them.
The "query optimiser" patch I've just committed pushes any the scaling factors to the leaves of the PostList tree, so marking this bug as fixed.
comment:14 by , 17 years ago
Blocking: | 200 removed |
---|---|
Operating System: | → All |
I've applied this to HEAD, and added smoketests to the bindings. Still needs:
the right expected descriptions. Needs to include multipliers of weight 1.5, 1, 0, -1)
multiplied by 0 (so, effectively boolean)