Opened 10 years ago
Closed 10 years ago
#653 closed defect (fixed)
OP_PHRASE seems to sometimes match terms in wrong order
Reported by: | Jean-Francois Dockes | Owned by: | Olly Betts |
---|---|---|---|
Priority: | highest | Milestone: | 1.2.21 |
Component: | Matcher | Version: | 1.2.16 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | Linux |
Description
It seems that OP_PHRASE will sometimes match terms in the wrong order. This was reported by a Recoll user on Xapian 1.2.12, and I seem to be able to reproduce it on Xapian 1.2.16:
jfmac$ delve -t hurricane xapiandb/ Posting List for term `hurricane' (termfreq 1, collfreq 2, wdf_max 2): 4 jfmac$ delve -t katrina xapiandb/ Posting List for term `katrina' (termfreq 1, collfreq 3, wdf_max 3): 4 jfmac$ delve -t hurricane -r 4 xapiandb/ Position List for term `hurricane', record #4: 199881 203084 jfmac$ delve -t katrina -r 4 xapiandb/ Position List for term `katrina', record #4: 199882 202473 203085 jfmac$ xadump -q katrina hurricane q argc 2 DB: ndocs 8 lastdocid 8 avglength 135233 DB: terms are stripped Performing query `Xapian::Query((katrina PHRASE 5 hurricane))' Estimated results: 1 Document ID 4 100% [url=file:///home/dockes/Downloads/[Mehdi_Khosrow-Pour,_Mehdi_Khosrow-Pour]_Encyclope(BookZZ.org).txt ...
(xadump is a tool in the recoll source which performs low level Xapian operations much like delve. Unlike delve it has a primitive query capability, normally performing AND queries, here modified to do OP_PHRASE, which a compiled-in phrase window (here 5). The window has to be at least 4 for the problem to appear (no match with a window of 2 or 3).
The index is not that big (3.6 MB compressed), I'll try to attach it to the ticket. If this fails, I can upload it wherever convenient.
Change History (11)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
Component: | Backend-Chert → Matcher |
---|---|
Milestone: | → 1.3.3 |
Status: | new → assigned |
Version: | → 1.2.16 |
No need for the test database - I can trivially create one to reproduce the issue:
Xapian::WritableDatabase wdb(get_writable_database()); Xapian::Document doc; doc.add_posting("hurricane", 199881); doc.add_posting("hurricane", 203084); doc.add_posting("katrina", 199882); doc.add_posting("katrina", 202473); doc.add_posting("katrina", 203085); wdb.add_document(doc); wdb.commit(); const char * qterms[] = { "katrina", "hurricane" }; Xapian::Enquire e(wdb); Xapian::Query q(q.OP_PHRASE, qterms, qterms + 2, 5); e.set_query(q); Xapian::MSet mset = e.get_mset(0, 100); TEST_EQUAL(mset.size(), 0);
It's also present in trunk - I'll investigate.
BTW, if you want a tool which can run queries for investigating these sorts of issues, take a look at "quest" (in xapian-core/examples/):
$ quest -d .brass/dbw 'katrina ADJ/4 hurricane' Parsed Query: Query((katrina@1 PHRASE 5 hurricane@2)) MSet: 1: [0.43676]
It doesn't support controlling every possible option, but you can certainly do more than just primitive querying with it, and it's gradually growing support for more things.
comment:3 by , 10 years ago
Milestone: | 1.3.3 → 1.3.2 |
---|
comment:4 by , 10 years ago
comment:5 by , 10 years ago
BTW, this only affects "loose" phrases, i.e. where window size > number of terms. For exact phrases, there's a special case handler since the checks required are simpler.
comment:6 by , 10 years ago
Milestone: | 1.3.2 → 1.3.3 |
---|
I'm going to move this back to 1.3.3 again. It would be good to confirm it's fixed (or fixed it more fully if the fix in comment:4 wasn't complete), but I don't want to delay 1.3.2 even longer.
comment:7 by , 10 years ago
Priority: | normal → highest |
---|
I've written a little C++ program to generate random OP_PHRASE
queries where window size > number of terms, and run them against the apitest_simpledata database from the xapian-core testsuite, and compare the results against a naively coded phrase check. It can crunch through over a million in under a minute, finding 1 or 2 per 100,000 that aren't matched correctly. Some cases OP_PHRASE
reports a match which the naive check doesn't see, while others are where OP_PHRASE
misses a match. I haven't hand inspected all of them, but at least some seem to be genuine (and not just bugs in my test program!)
So unfortunately this isn't fixed fully, but on the plus side, I can generate lots of test cases for it on a tiny data set.
comment:8 by , 10 years ago
Also of note is that I can't find any erroneous cases with just two terms, so my original fix looks good for that.
comment:9 by , 10 years ago
Milestone: | 1.3.3 → 1.2.21 |
---|
Fixed in [a30f7ab6621f7533349678930ad421c9f7fb9baf].
Wants backporting to 1.2.x.
comment:10 by , 10 years ago
OP_NEAR had issues too - fixed those in [ffbb7f0efed958ca1c43a88805f5ec3e376289e8].
comment:11 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
I've backported both these fixes plus some follow-on improvements for 1.2.21.
The file is too big for upload. Let me know where I can send it (if needed).