Opened 15 years ago
Closed 15 years ago
#475 closed defect (fixed)
Multi XOR can return inconsistent results
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.0.21 |
Component: | Matcher | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
I've run the following query, on a database containing 1000 random documents, in which each document contains a sequence of terms starting at N1, and ending at Nx, where x is a random integer in the range 1 to 100 (inclusive). ie, in this database, if a document contains term Nx, it also contains N(x-1), for all x > 1. All wdfs are 1.
Xapian::Query((N48:(wqf=2) XOR (N70 AND_NOT N42) XOR (N99 OR N68 OR N10 OR N28) XOR N22 XOR N44))
When I ask for only the top result (get_mset(0, 1)), a document which contains all terms to N99 is returned.
When I ask for all the results (get_mset(0, 1000)), this document is not returned.
I believe this is because, when asking for only the top result, the postlist decays changing some of the XORs into AND_NOTs.
The decay of XOR to AND_NOT changes the semantics of a multi-XOR (which is to match if an odd number of the children match). Consider:
N99 XOR N44 XOR N22
N99 present implies N44 and N22 present, so a document containing N99 will be returned.
However, this can decay to
(N99 XOR N44) AND_NOT N22
N99 present implies all the other terms are present, so the XOR part of this never matches a document containing N99.
I think the upshot of this is that either:
- we should disable the decay of XOR to AND_NOT, or
- we should disallow more than 2 children being passed to an XOR query constructor (and disallow redistribution of terms of XOR queries next to each other in the tree), or
- we should implement a MultiXORPostList which handles the decay correctly (though I've not thought hard about what that means).
Change History (8)
comment:1 by , 15 years ago
Version: | → SVN trunk |
---|
comment:3 by , 15 years ago
From discussion on IRC, it seems the problem is probably that XOR should pass min_weight of 0 to child posting lists, since it wants to know if either child matches since that match invalidates a match in the other child. Decay to AND_NOT should be safe still (the actual issue with the example being with the left XOR, not the decaying right one).
Richard: Are you working on this? Do you have a reduced test case already, or just the description of one?
comment:4 by , 15 years ago
I committed a fix and a testcase in r14525 for the problem of XOR children decaying, causing the XOR to return documents which it shouldn't have returned.
However, I think there is still a problem with the weights returned from multi-XORs: if we have A XOR B XOR C, a document which contains A, B and C should be returned, and the weight should be that from A + B + C. However, it will actually be the weights of exactly one of A, B or C; worse, it's unpredictable which of those weights will be returned because it depends on what order the posting list tree was built in (which isn't under the users control). I think to fix this there's no other option than implementing a MultiXorPostList - I'm not sure it's worth the effort, though.
comment:5 by , 15 years ago
Status: | new → assigned |
---|
Do you have a testcase for the multi-way xor issue?
comment:6 by , 15 years ago
Status: | assigned → new |
---|
comment:7 by , 15 years ago
Milestone: | 1.2.1 → 1.0.21 |
---|---|
Status: | new → assigned |
MultiXorPostList implemetation added in r14608, which should fix this completely.
Marking for backport to 1.0.21.
comment:8 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Backported for 1.0.21 as r14609.
This occurs in SVN trunk, but I suspect it's been present in Xapian for many years.