wiki:FAQ/FindSimilar

How can I implement a "find documents like this one" feature?

There are (at least) a couple of ways to do this.

One is to use Query::OP_ELITE_SET. You can give it all the terms from the document you want to find more like and it will pick the best N and make them into an "OR" query. OP_ELITE_SET can of course be combined with other query operators.

You should probably think of "best" as defined by outcome rather than anything else, but currently it picks the terms with the highest maximum termweight (as reported by the current weighting scheme). The aim is to try to pick terms which have a good discriminating power in the index and to drop stopwords (e.g. "and", "the", "a", etc) and other common terms

  • for example, when searching the Xapian website the term "xapian" isn't

very interesting because it features on almost every page.

The terms can either come from a document which is in a Xapian database, or from parsing some text, so OP_ELITE_SET allows "find more like" to work on an arbitrary piece of text - in fact it was originally added to allow implementation of a desktop tool which you could drag documents to which would launch a search for similar documents.

The other approach only works for finding documents similar to a document in a database and involves marking the given document as relevant and generating an ESet:

    Xapian::Enquire enquire(db);

    Xapian::RSet rset;
    rset.add_document(1);
    Xapian::ESet eset = enquire.get_eset(40, rset);

    Xapian::Query query(Xapian::Query::OP_OR, eset.begin(), eset.end());

    enquire.set_query(query);

You may want to use an ExpandDecider subclass to reject some or all prefixed terms when creating the ESet.

This approach extends naturally to finding documents like a given set of documents - just add all the given documents to the RSet.

This is how Omega's "MORELIKE" functionality is currently implemented.

FAQ Index

Last modified 17 years ago Last modified on 07/03/08 03:24:16
Note: See TracWiki for help on using the wiki.