Opened 13 years ago
Closed 10 years ago
#557 closed enhancement (wontfix)
Allow subqueries to use separate weighting schemes
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Library API | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
When using multiple sources of weighting information, it would be very handy to be able to use separate weighting schemes for some subqueries. This could be implemented by adding a Query operator which takes a subquery and a Weight object, and causes the Weight object to be used for all posting lists generated by the subquery.
Example of a situation this could be useful in: Imagine a database of documents tagged with people IDs. Suppose that people who are tagged in more events are considered more important, since they may represent "authorities" in the social network. If searching for documents matching a query, and also matching a set of IDs, the query part may want to use standard BM25 weighting, but the ID part may want to use a weighting scheme which applies a higher weight to IDs with a higher termfrequency, rather than a lower weight.
Things to think about:
- What to do about the term-independent part of the weight (probably we'd just use the term-independent part of the top-level weight).
- How does this interact with OP_SYNONYM?
- Should the query length each weight object sees be the global query length, or just the length of the part of the object with the adjusted weight? If the latter, should the query length of the parts of the query without the adjusted weight be reduced accordingly?
We've just been discussing this on IRC.
Technically this should be possible to implement, but I'm dubious that adding together weights from different weighting formulae is useful in general - weighting schemes produce weights for terms which ought to produce a useful outcome when added together, but only within a weighting scheme - e.g. ranking computers by
cpu_speed + disk_size
is unlikely to make sense even if you thinkcpu_speed
anddisk_size
are useful rankings by themselves (but to continue the analogy you can usefully rank by the sum ofdisk_size
across multiple disks).In the social network example, the second weighting scheme would need to be one carefully tailored to fit with the main weighting scheme, so being able to reuse existing Weight subclasses isn't a consideration here, and I think that will be true in general. I think this example could be addressed well with a PostingSource subclass, and I suspect that's true more generally.
So I think we should close this as wontfix, unless we can come up with a scenario where it is actually a better approach than PostingSource.