Opened 11 years ago

Closed 11 years ago

#636 closed defect (fixed)

get_docid() and multiple databases

Reported by: Jeff Rand Owned by: Olly Betts
Priority: normal Milestone: 1.2.18
Component: Library API Version: 1.2.12
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: Linux

Description (last modified by Olly Betts)

I'm using the python bindings for xapian 1.2.12 and I'm getting some unexpected behavior which I believe is a bug. While searching multiple databases I am getting inconsistent values from doc.get_docid() when using an overloaded KeyMaker class for custom sorting. The id value in the document's data is the same as the id set for each document.

The behavior is expected when searching only one database: doc.get_docid() == int(json.loads(doc.get_data())['id']) .

When searching more than one database the doc.get_data() will return a value that is not the same as int(json.loads(doc.get_data())['id']).

According to the docs: docid Xapian::Document::get_docid ( ) const

Get the document id which is associated with this document (if any). NB If multiple databases are being searched together, then this will be the document id in the individual database, not the merged database!

Here's my sample code and some output:

import xapian as x
import simplejson as json

db = x.Database()
db.add_database(x.Database('/var/xapian/db1.db')) #has XTYPA

q = x.Query('XTYPA')
q = x.Query(x.Query.OP_OR, q, x.Query('XTYPB')) 

class WhatsTheId(x.KeyMaker):                                         
    def __init__(self):                                                         
        return super(WhatsTheId, self).__init__()                               
    def __call__(self, doc):                                                    
        my_doc_id = json.loads(doc.get_data())['id']                            
        if my_doc_id <= 10:                                                     
            print doc.get_docid(), my_doc_id, json.loads(doc.get_data())['type']
        return x.sortable_serialise(1)                                          

e = x.Enquire(db)
e.set_query(q)
e.set_sort_by_key(WhatsTheId()) 
e.get_mset(0, 1000000000, 0, None)

# Expected results

2 2 A 3 3 A 4 4 A 5 5 A 6 6 A 7 7 A 8 8 A 9 9 A 10 10 A

db.add_database(x.Database('/var/xapian/db2.db')) #has XTYPB

e = x.Enquire(db)
e.set_query(q)
e.set_sort_by_key(WhatsTheId()) 
r = e.get_mset(0, 1000000000, 0, None)

# Add another, unexpected results

3 2 A 5 3 A 7 4 A 9 5 A 11 6 A 13 7 A 15 8 A 17 9 A 19 10 A 2 1 B 4 2 B

# It will consistently modify the internal get_docid value when adding more databases:

q = x.Query(x.Query.OP_OR, q, x.Query('XTYPC'))
db.add_database(x.Database('/var/xapian/db3.db')) #has XTYPC
e = x.Enquire(db)
e.set_query(q)
e.set_sort_by_key(WhatsTheId()) 
r = e.get_mset(0, 1000000000, 0, None)

4 2 A 7 3 A 10 4 A 13 5 A 16 6 A 19 7 A 22 8 A 25 9 A 28 10 A 2 1 B 5 2 B 3 1 C 6 2 C 9 3 C 12 4 C 15 5 C 18 6 C 21 7 C 24 8 C 27 9 C 30 10 C

Change History (3)

comment:1 by Olly Betts, 11 years ago

Description: modified (diff)
Milestone: 1.3.2
Status: newassigned

comment:2 by Olly Betts, 11 years ago

Component: OtherLibrary API
Milestone: 1.3.21.2.18

Fixed in trunk r17981 (building on r17979 and r17980). The code is actually simpler after this fix, and I think it fixes a bug in the old code (but I haven't yet tried to create a testcase which exercises it).

Needs backporting to 1.2.x.

comment:3 by Olly Betts, 11 years ago

Resolution: fixed
Status: assignedclosed

Turns out the old code was correct - one of the subdocid calculations was a bit oddly expressed, but completely equivalent to the correct version.

Backported in r17982, r17983 and r17984.

Note: See TracTickets for help on using tickets.