Opened 13 years ago

Closed 9 years ago

#557 closed enhancement (wontfix)

Allow subqueries to use separate weighting schemes

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone:
Component: Library API Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

When using multiple sources of weighting information, it would be very handy to be able to use separate weighting schemes for some subqueries. This could be implemented by adding a Query operator which takes a subquery and a Weight object, and causes the Weight object to be used for all posting lists generated by the subquery.

Example of a situation this could be useful in: Imagine a database of documents tagged with people IDs. Suppose that people who are tagged in more events are considered more important, since they may represent "authorities" in the social network. If searching for documents matching a query, and also matching a set of IDs, the query part may want to use standard BM25 weighting, but the ID part may want to use a weighting scheme which applies a higher weight to IDs with a higher termfrequency, rather than a lower weight.

Things to think about:

  • What to do about the term-independent part of the weight (probably we'd just use the term-independent part of the top-level weight).
  • How does this interact with OP_SYNONYM?
  • Should the query length each weight object sees be the global query length, or just the length of the part of the object with the adjusted weight? If the latter, should the query length of the parts of the query without the adjusted weight be reduced accordingly?

Change History (1)

comment:1 by Olly Betts, 9 years ago

Resolution: wontfix
Status: newclosed

We've just been discussing this on IRC.

Technically this should be possible to implement, but I'm dubious that adding together weights from different weighting formulae is useful in general - weighting schemes produce weights for terms which ought to produce a useful outcome when added together, but only within a weighting scheme - e.g. ranking computers by cpu_speed + disk_size is unlikely to make sense even if you think cpu_speed and disk_size are useful rankings by themselves (but to continue the analogy you can usefully rank by the sum of disk_size across multiple disks).

In the social network example, the second weighting scheme would need to be one carefully tailored to fit with the main weighting scheme, so being able to reuse existing Weight subclasses isn't a consideration here, and I think that will be true in general. I think this example could be addressed well with a PostingSource subclass, and I suspect that's true more generally.

So I think we should close this as wontfix, unless we can come up with a scenario where it is actually a better approach than PostingSource.

Note: See TracTickets for help on using tickets.