Opened 10 years ago

Closed 9 years ago

#653 closed defect (fixed)

OP_PHRASE seems to sometimes match terms in wrong order

Reported by: Jean-Francois Dockes Owned by: Olly Betts
Priority: highest Milestone: 1.2.21
Component: Matcher Version: 1.2.16
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: Linux

Description

It seems that OP_PHRASE will sometimes match terms in the wrong order. This was reported by a Recoll user on Xapian 1.2.12, and I seem to be able to reproduce it on Xapian 1.2.16:

jfmac$ delve -t hurricane  xapiandb/
Posting List for term `hurricane' (termfreq 1, collfreq 2, wdf_max 2): 4

jfmac$ delve -t katrina  xapiandb/
Posting List for term `katrina' (termfreq 1, collfreq 3, wdf_max 3): 4

jfmac$ delve -t hurricane  -r 4 xapiandb/
Position List for term `hurricane', record #4: 199881 203084

jfmac$ delve -t katrina  -r 4 xapiandb/
Position List for term `katrina', record #4: 199882 202473 203085

jfmac$ xadump -q katrina hurricane
q argc 2
DB: ndocs 8 lastdocid 8 avglength 135233
DB: terms are stripped
Performing query `Xapian::Query((katrina PHRASE 5 hurricane))'
Estimated results: 1
Document ID 4	100% [url=file:///home/dockes/Downloads/[Mehdi_Khosrow-Pour,_Mehdi_Khosrow-Pour]_Encyclope(BookZZ.org).txt
...

(xadump is a tool in the recoll source which performs low level Xapian operations much like delve. Unlike delve it has a primitive query capability, normally performing AND queries, here modified to do OP_PHRASE, which a compiled-in phrase window (here 5). The window has to be at least 4 for the problem to appear (no match with a window of 2 or 3).

The index is not that big (3.6 MB compressed), I'll try to attach it to the ticket. If this fails, I can upload it wherever convenient.

Change History (11)

comment:1 by Jean-Francois Dockes, 10 years ago

The file is too big for upload. Let me know where I can send it (if needed).

comment:2 by Olly Betts, 10 years ago

Component: Backend-ChertMatcher
Milestone: 1.3.3
Status: newassigned
Version: 1.2.16

No need for the test database - I can trivially create one to reproduce the issue:

    Xapian::WritableDatabase wdb(get_writable_database());
    Xapian::Document doc;
    doc.add_posting("hurricane", 199881);
    doc.add_posting("hurricane", 203084);
    doc.add_posting("katrina", 199882);
    doc.add_posting("katrina", 202473);
    doc.add_posting("katrina", 203085);
    wdb.add_document(doc);
    wdb.commit();
    const char * qterms[] = { "katrina", "hurricane" };
    Xapian::Enquire e(wdb);
    Xapian::Query q(q.OP_PHRASE, qterms, qterms + 2, 5);
    e.set_query(q);
    Xapian::MSet mset = e.get_mset(0, 100);
    TEST_EQUAL(mset.size(), 0);

It's also present in trunk - I'll investigate.

BTW, if you want a tool which can run queries for investigating these sorts of issues, take a look at "quest" (in xapian-core/examples/):

$ quest -d .brass/dbw 'katrina ADJ/4 hurricane'
Parsed Query: Query((katrina@1 PHRASE 5 hurricane@2))
MSet:
1: [0.43676]

It doesn't support controlling every possible option, but you can certainly do more than just primitive querying with it, and it's gradually growing support for more things.

comment:3 by Olly Betts, 9 years ago

Milestone: 1.3.31.3.2

comment:4 by Olly Betts, 9 years ago

The changes in r18282 at least fix the test case in comment:2, though I suspect there may still be issues with non-tight phrases with more than two terms. I'll review the code in more detail and set up an automated soak test of phrase matching.

comment:5 by Olly Betts, 9 years ago

BTW, this only affects "loose" phrases, i.e. where window size > number of terms. For exact phrases, there's a special case handler since the checks required are simpler.

comment:6 by Olly Betts, 9 years ago

Milestone: 1.3.21.3.3

I'm going to move this back to 1.3.3 again. It would be good to confirm it's fixed (or fixed it more fully if the fix in comment:4 wasn't complete), but I don't want to delay 1.3.2 even longer.

comment:7 by Olly Betts, 9 years ago

Priority: normalhighest

I've written a little C++ program to generate random OP_PHRASE queries where window size > number of terms, and run them against the apitest_simpledata database from the xapian-core testsuite, and compare the results against a naively coded phrase check. It can crunch through over a million in under a minute, finding 1 or 2 per 100,000 that aren't matched correctly. Some cases OP_PHRASE reports a match which the naive check doesn't see, while others are where OP_PHRASE misses a match. I haven't hand inspected all of them, but at least some seem to be genuine (and not just bugs in my test program!)

So unfortunately this isn't fixed fully, but on the plus side, I can generate lots of test cases for it on a tiny data set.

comment:8 by Olly Betts, 9 years ago

Also of note is that I can't find any erroneous cases with just two terms, so my original fix looks good for that.

comment:9 by Olly Betts, 9 years ago

Milestone: 1.3.31.2.21

Fixed in [a30f7ab6621f7533349678930ad421c9f7fb9baf].

Wants backporting to 1.2.x.

comment:10 by Olly Betts, 9 years ago

OP_NEAR had issues too - fixed those in [ffbb7f0efed958ca1c43a88805f5ec3e376289e8].

comment:11 by Olly Betts, 9 years ago

Resolution: fixed
Status: assignedclosed

I've backported both these fixes plus some follow-on improvements for 1.2.21.

Note: See TracTickets for help on using tickets.