#363 closed defect (fixed)
Avoid using the termlist when calculating percentages in get_mset()
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | high | Milestone: | 1.1.2 |
Component: | Matcher | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | #181 | Operating System: | All |
Description
Currently, get_mset() uses the termlist for assigning percentages; it iterates through the termlist looking up the weight for each term in it and in the query, and adding them together to get a value for the weight for 100%.
This requires an access to the termlist (which can be quite costly) and also doesn't work correctly for synonyms, since these aren't in the termlist.
This could be avoided by keeping track of which Xapian::Weight::Internal objects were used for the best msetitem, or just of the sum of the termweights for these.
This information would need to be serialised along with the MSet.
Attachments (1)
Change History (11)
comment:1 by , 16 years ago
Status: | new → assigned |
---|
comment:2 by , 16 years ago
One method for fixing this:
- Keep a vector of all the weight objects generated when building the query.
- Tell each leaf postlist (or weight object) the index of its weight object in this list.
- Introduce a vector<bool> which is used to keep track of which weight objects have been used.
- Clear this before calling get_weight() on the top postlist from get_mset().
- Whenever a leaf postlist's get_weight() method is called, set the appropriate bit.
- If the result is a new top-doc in get_mset(), copy its vector<bool> to a safe place.
- Use the vectors to calculate the weight.
Risk: clearing and setting the bits in the vector for every match might be overly CPU intensive.
comment:3 by , 16 years ago
Feel free to try it, but I think that's going to be measurably slower.
On the plus side, note that this can potentially be optimised for AND queries with all leaf children - e.g. MultiAndPostList of terms knows for certain that all its sub-postlists will match, so no need to recurse them - we can store a mask to | in.
I think we would probably want a specialised class rather than vector<bool>
as we don't need dynamic resizing (just a runtime-specified fixed size, which rules out bitset<>
which has a compile-time constant fixed size) but we do want the ability to efficiently do: v1 |= v2;
But vector<bool>
should be fine to prototype what sort of overhead there is here.
Perhaps better, be lazy about this and only do it for a match which is the current best via a new method. This has a bad worst case though - if the weights increase with docid it's even more work than the non-lazy approach as there are a lot of extra virtual method calls.
This could be optional on a "want percentages" flag, but it would be nice to use this for "matching terms" too. Which would require calculating and keeping these for anything in the proto-MSet...
I don't think we want to rush into this, but we clearly can't ship 1.1.1 with the testsuite failing assertions. So I think it's better to find a quick fix for the synonym4 assertion failure, perhaps along the lines of the (non-working) patch I tried.
comment:4 by , 16 years ago
I fixed the assertion failure in r12623; essentially by forcing the percentage of the top document to be 100% if no non-synonym terms match it. However, this isn't a very desirable result, so some alternative way to fix it would be good.
comment:5 by , 16 years ago
Milestone: | 1.1.1 → 1.1.3 |
---|
comment:6 by , 15 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
I've had an idea...
We could just count the number of matching leaves and divide by the (original) total to get the factor to scale down the top match by. This isn't quite as good as the current approach, which scales down by more if a higher scoring term is missing, but it's simpler, and percentages are rather arbitrary anyway. We also suspected they are less popular than they were.
Having a look at this idea, so reassigning the bug, at least for now.
by , 15 years ago
Attachment: | xapian-percent-scaling-without-termlists.patch added |
---|
Patch which implements my idea
comment:7 by , 15 years ago
Milestone: | 1.1.3 → 1.1.2 |
---|---|
Status: | new → assigned |
OK, I have a patch, and it passes the testsuite with changes to two tests:
- topercent2 needs a tweak because we generate different percentages
- synonym4 needs a tweak to disable a check that we attain 100% for OP_XOR of a term and a synonym of two terms.
The former seems fine to me.
As for the latter, unless I'm missing something, it's actually incorrect that we give 100% currently. Or at least inconsistent - perhaps a XOR b
should give 100% if the XOR matches, but it doesn't currently.
Although the patch currently passes all tests, I think the remote case may need further work. I don't think the testsuite cover multi of local and remote (we could add support for this as another multi variant, though the combinatorial explosion means we need to be careful about how exactly we do this...) Comments are still welcome.
Setting milestone to 1.1.2 since we have a patch of sorts - if we omit this from 1.1.2 it should be a concious choice.
comment:8 by , 15 years ago
Priority: | normal → high |
---|
comment:9 by , 15 years ago
Patch committed with one change - I added a check that XOR with a SYNONYM subquery does not achieve 100%.
Leaving this open for now as there may be a few remaining issues to deal with.
comment:10 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
OK, I think I've now dealt with the remaining issues.
Note - trunk is currently failing when assertions are on in the synonym4 test in apitest due to this bug.