Opened 17 years ago

Closed 17 years ago

Last modified 17 years ago

#203 closed defect (released)

Add query operator to multiply weights by a factor

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone:
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently, the only way to manually modify the weights of particular terms is to adjust the wdf at the time they are indexed. This is useful, but limited. I would like to be able to modify the weights of particular terms at query construction time. I believe the easiest way to do this is to add a query operator which takes a parameter (as a double) and a single sub-query, and multiplies all the weights of the subquery by the parameter. This would allow arbitrary weight modifications at search time, by applying the operator to parts of the query. This could also be very useful for integrating "expanded" query terms with a lower importance than "user-entered" query terms, indicating differing levels of confidence in the importance of terms.

This may be a little controversial as a new feature, because I know of no theoretical model to explain why multiplying the weights of part of a query by a factor should be a useful thing to do. However, I've come across several places where some kind of search-time weight adjustment would be extremely useful, this is the simplest way I can think of to adjust weights, and my initial experiments with this have produced plausible-looking results.

I have produced an initial implementation of this feature, which I will attach to this bug. Ideally, I'd like to apply it to head shortly after we've released 1.0.3, so I'll mark the bug as blocking 1.0.4, for now.

Attachments (2)

multweight.patch (22.9 KB ) - added by Richard Boulton 17 years ago.
Patch implementing new OP_MULT_WEIGHT query operator
patch (7.1 KB ) - added by Richard Boulton 17 years ago.
Nearly-working implementation of SCALE_WEIGHT query optimisation

Download all attachments as: .zip

Change History (15)

comment:1 by Richard Boulton, 17 years ago

Blocking: 200 added
Status: newassigned

comment:2 by Richard Boulton, 17 years ago

I've applied this to HEAD, and added smoketests to the bindings. Still needs:

  1. More C++ tests (for various structured queries, to check that they produce

the right expected descriptions. Needs to include multipliers of weight 1.5, 1, 0, -1)

  1. Support for decomposing repeated mult_weight operators: eg,

Foo * 5 * 1/5 == Foo Foo * 5 * 10 == Foo * 50 (Foo * 5 OR Bar * 5) * 10 == (Foo OR Bar) * 50 (Foo * 5) * 0 == Foo * 0

  1. Tests for result of running a search with some parts of the query tree

multiplied by 0 (so, effectively boolean)

comment:3 by Olly Betts, 17 years ago

Blocked By: 164 added

I had some comments on this, but didn't have time to write them before going away.

I think the name could be improved - it would be better to use "MULTIPLY" than a somewhat awkward abbreviation like "MULT", but I don't think multiplication is really the essence of this operation anyway, and something like "OP_SCALE_WEIGHT" seems better still.

The test against DBL_EPSILON to decide if we're boolean seems incorrect - I think just (dbl_parameter <= 0.0) is what is needed there. Actually, perhaps a negative scaling factor should result in InvalidArgumentError rather than being silently clipped to 0?

I think we should probably add this before OP_ELITE_SET so that we no longer have a "gappy" enum (unless there's an issue with doing so which I'm missing). The gap is due to OP_WEIGHT_CUTOFF (or whatever it was called) being removed, and we've definitely broken ABI compatibility since then.

Also, AssertNe(multiplier, 0) should probably just be written as Assert(multiplier != 0) - the assertion message only happens when x == 0, so using AssertNe() just means more code is generated for producing the assertion failure message without actually improving it at all.

comment:4 by Richard Boulton, 17 years ago

I like the name OP_SCALE_WEIGHT much better - I've put together a patch to change it to that. For consistency, I've also renamed MultWeightPostList (and the corresponding source files).

I've also moved the operator to be before OP_ELITE_SET - I hadn't remembered why there was a gap there, so I just added to the end of the enum. It is indeed better to fill the gap.

Your comment about assertions makes sense, I'll bear that in mind in future (and I've changed it).

You're probably right about the DBL_EPSILON test, too: my thinking behind testing against DBL_EPSILON to decide if we're boolean was that handling a query as boolean, instead of multiplying its weight by a very small value, is just an optimisation. However, the ordering of documents whose weights are multiplied by DBL_EPSILON is is likely to be different from boolean sort order, so this was probably a mistake.

If we're not using the DBL_EPSILON test, it might well make sense to raise an exception for a negative value. My motivation for clipping negative values to zero was that, if we're doing a DBL_EPSILON test, it would be wrong to raise an exception for a negative value which is greater than -DBL_EPSILON. It seemed wrong to have an exception raised for larger negative values, if small negative values were permitted. This reasoning may not be that sound anyway, but if we're not going to do an epsilon test anyway, it's all moot.

So, I'm in favour of removing the DBL_EPSILON test against 0, and raising an exception if dbl_parameter < 0.

I still think we should keep the epsilon test in api/omqueryinternal.cc in simplify_query() comparing dbl_parameter to 1.0 - this _is_ just an optimisation, and if the value is very slightly different from 1.0 the effect on rankings will be negligible.

comment:5 by Richard Boulton, 17 years ago

Not sure why this bug depends on #164 - Olly: can you comment? I'm wondering if making the dependency was a slip...

comment:6 by Olly Betts, 17 years ago

Blocked By: 164 removed

Did I add the dependency on 164? Bugzilla doesn't seem to say, but if so it must have been finger trouble. Removed it anyway, as it's clearly bogus.

The DBL_EPSILON test in simplify_query() seems fine to me - that's the sort of use which the epsilon constants are intended for.

It would be good to ensure there's a decomposing test for a fraction which isn't exactly representable in base 2, and which when multiplied by its reciprocal doesn't give 1.0 exactly in IEEE double representation.

comment:7 by Richard Boulton, 17 years ago

I was about to ask why this bug depended on #164 (it was set to do so by Olly a few hours ago)...

I've written and committed some tests for searches with parts of the query tree multiplied by 0 (and by negative numbers), and have written (but not yet tidied and committed) a few more tests of queries with different structures.

Therefore, the only significant outstanding feature needed before this bug can be closed is support for decomposing nested OP_SCALE_WEIGHT operators (this probably isn't desperately important, but I'd like to pull common multipliers out of OR or AND groups, so that the normal optimisations can work within those groups).

The network protocol should also be bumped (but just a minor bump) because an old server won't understand a query with the OP_SCALE_WEIGHT operator in it.

comment:8 by Olly Betts, 17 years ago

I think it's better to leave further optimisations of this for now. I'm looking at doing some reworking for bug#164 which may make this easier to do...

comment:9 by Richard Boulton, 17 years ago

Oh - I spent a little time yesterday while my network connection was down implementing some optimisations for pulling scale-weight nodes up. I didn't quite finish it (I messed up somewhere, resulting in a memory leak for certain cases), but it was nearly working. I'm sure the tests in my patch will be useful at some point though, even if the implementation isn't, so I'll attach it to this bug shortly.

by Richard Boulton, 17 years ago

Attachment: patch added

Patch implementing new OP_MULT_WEIGHT query operator

comment:10 by Richard Boulton, 17 years ago

attachments.isobsolete: 01

comment:11 by Olly Betts, 17 years ago

Cc: olly@… removed
Owner: changed from Richard Boulton to Olly Betts
Status: assignednew

We decided on IRC it was best to do this by pushing the scaling factor(s) down to the Xapian::Weight objects. With a tiny tweak they can usually fold it into their own scaling factor(s), and for Xapian::Weight sub-classes which haven't been tweaked (i.e. user-implemented ones) we can add a "shim" class for compatibility.

Taking this bug, as I'm working on this. We should keep the new testcases from the patch I think.

comment:12 by Olly Betts, 17 years ago

Resolution: fixed
Status: newclosed

Hmm, except that the test-cases check that Query::get_description() contains the simplifications, but we aren't simplifying until we build the PostList tree. I don't see how we can rework them though - if you see a way feel free to salvage them.

The "query optimiser" patch I've just committed pushes any the scaling factors to the leaves of the PostList tree, so marking this bug as fixed.

comment:13 by Olly Betts, 17 years ago

Resolution: fixedreleased

Fixed in 1.0.4

comment:14 by Olly Betts, 17 years ago

Blocking: 200 removed
Operating System: All
Note: See TracTickets for help on using tickets.