How do I use unique ids with Xapian?
Often the documents which you're indexing with Xapian will already have unique ids which you want to be able to use to reindex an updated version of an existing document, or delete an expired document from the Xapian index.
If these unique ids are positive integers and contiguous, or without many gaps, then you can just use them as the document ids in the Xapian index (see below for discussion of this option). If the unique ids in the other system are non-numeric, or numeric but sparse, then the solution is to add the unique id to each document as a term, prefixed with some prefix you're not using for anything else. By convention, the prefix "Q" is used for this purpose.
For example, if the unique id for a particular document is "65A", you'd add a term "Q65A" to the document (before adding it to the database) with doc.add_term("Q65A"). To update the document with unique id "65A", use db.replace_document("Q65A", doc) and to delete it use db.delete_document("Q65A").
If you want to retrieve the document, open the postlist for "Q65A" and read the first (and only) document id (this will be a little more efficient that running a search for "Q65A"). In C++, you can do this like so:
Xapian::PostingIterator p = db.postlist_begin("Q65A");
if (p == db.postlist_end("Q65A")) {
cout << "No document with id 65A" << endl;
} else {
cout << "Document with id 65A has docid " << *p << endl;
}
Using the external unique id as the Xapian docid
If the ids are positive integers without a lot of big gaps, you can just use the unique id as the Xapian document id.
For example, often a unique if is allocated by an integer which is incremented for each new allocated id. There may be gaps for deleted documents, but the ids aren't sparse. This is a suitable situation for this technique.
When adding a document, use db.replace_document(unique_id, doc) rather than db.add_document(doc) - if there's no document with unique_id yet, replace_document will just create one. The same code is used for updating an existing document.
To remove a document, use db.delete_document(unique_id). And to retrieve the document with a particular id, use db.get_document(unique_id).
To get the unique id for documents matching a search, iterate over them and use MSetIterator::get_docid().
When indexing, Xapian can work most efficiently when each new document has a higher document id than any in the database already. So when indexing a lot of documents, try to arrange this if possible.
