How do I use unique ids with Xapian?
Often the documents which you're indexing with Xapian will already have unique ids which you want to be able to use to reindex an updated version of an existing document, or delete an expired document from the Xapian index.
Using the external unique id as the Xapian docid
If the ids are positive integers and contiguous (or without a lot of big gaps), you can just use the external unique id as the Xapian document id. Note that because Xapian document ids are 32 bit unsigned integers, if your external ids can be larger than about 4 billion this won't work and you'll need to use the other approach below.
For example, often a unique id is allocated by an integer which is incremented for each new allocated id. There may be gaps for deleted documents, or perhaps if a commit fails, but the ids aren't sparse. This is a suitable situation for this technique.
When adding a document, use db.replace_document(unique_id, doc)
rather than
db.add_document(doc)
- if there's no document with unique_id yet, replace_document will just
create one. The same code is used for updating an existing document.
To remove a document, use db.delete_document(unique_id)
. And to retrieve the document with a
particular id, use db.get_document(unique_id)
.
To get the unique id for documents matching a search, iterate over them and use MSetIterator::get_docid()
.
When indexing, Xapian can work most efficiently when each new document has a higher document id than any in the database already. So when indexing a lot of documents, try to arrange this if possible.
Using a term for the external unique id
If the unique ids in the other system are non-numeric, or numeric but sparse, then the solution is to add the unique id to each document as a term, prefixed with some prefix you're not using for anything else. By convention, the prefix "Q" is used for this purpose.
For example, if the unique id for a particular document is "65A", you'd add a term "Q65A" to the document
(before adding it to the database) with doc.add_term("Q65A")
. To update the document with unique id
"65A", use db.replace_document("Q65A", doc)
and to delete it use db.delete_document("Q65A")
.
Note that get_document()
cannot work with an external unique id term, so instead you need to open the
postlist for "Q65A" and read the first (and only) document id (this will be a little more efficient than
running a search for the term "Q65A"). In C++, you can do this like so:
Xapian::PostingIterator p = db.postlist_begin("Q65A"); if (p == db.postlist_end("Q65A")) { cout << "No document with id 65A" << endl; } else { cout << "Document with id 65A has docid " << *p << endl; }
Or in Python:
postlist = db.postlist("Q65A") try: plitem = postlist.next() except StopIteration: raise KeyError("No document with id 65A") print "Document with id 65A has docid %d" % plitem.docid
Or in PHP:
$postlist = $db->postlist_begin('Q65A'); if ($postlist->equals($db->postlist_end('Q65A'))) echo "No document with ID 65A"; else echo "Document with id 65A has docid ", $postlist->get_docid();
Working round the term length limit
There's a limit of 245 bytes on the length of terms. If your unique ids can be longer than this (e.g. URLs or filenames) then you'll have to split them up over several terms.
For example:
- if the path length is < 240, index and search using: "P" + path
- if the path length is >= 240 but < 480, index:
You won't be able to use the term forms of
doc.add_term("XA" + path.substr(0, 240)); doc.add_term("XB" + path.substr(241));
replace_document()
ordelete_document()
when updating the database with a document with such a term - instead search for those same terms ANDed together to find the document id, and then use that.
- if the path length is >= 480, do a similar thing but split over more terms...
Alternatively, you could use a cryptographic hash (like SHA1), where likelihood of a collision is low enough that it can be ignored. In this case, you could construct a "Q"-prefixed term as previously, with the rest of the term being a representation of the SHA1 hash.