How can I implement a "find documents like this one" feature?
There are (at least) a couple of ways to do this.
One is to use Query::OP_ELITE_SET
. You can give it all the terms from
the document you want to find more like and it will pick the best N and make
them into an "OR" query. OP_ELITE_SET
can of course be combined with other
query operators.
You should probably think of "best" as defined by outcome rather than anything else, but currently it picks the terms with the highest maximum termweight (as reported by the current weighting scheme). The aim is to try to pick terms which have a good discriminating power in the index and to drop stopwords (e.g. "and", "the", "a", etc) and other common terms
- for example, when searching the Xapian website the term "xapian" isn't
very interesting because it features on almost every page.
The terms can either come from a document which is in a Xapian database,
or from parsing some text, so OP_ELITE_SET
allows "find more like" to
work on an arbitrary piece of text - in fact it was originally added
to allow implementation of a desktop tool which you could drag documents
to which would launch a search for similar documents.
The other approach only works for finding documents similar to a document
in a database and involves marking the given document as relevant and
generating an ESet
:
Xapian::Enquire enquire(db); Xapian::RSet rset; rset.add_document(1); Xapian::ESet eset = enquire.get_eset(40, rset); Xapian::Query query(Xapian::Query::OP_OR, eset.begin(), eset.end()); enquire.set_query(query);
You may want to use an ExpandDecider
subclass to reject some or all prefixed
terms when creating the ESet
.
This approach extends naturally to finding documents like a given set of
documents - just add all the given documents to the RSet
.
This is how Omega's "MORELIKE" functionality is currently implemented.