| 1 | |
|---|
| 2 | .. Copyright (C) 2007 Olly Betts |
|---|
| 3 | |
|---|
| 4 | ========================= |
|---|
| 5 | Sorting of Search Results |
|---|
| 6 | ========================= |
|---|
| 7 | |
|---|
| 8 | .. contents:: Table of contents |
|---|
| 9 | |
|---|
| 10 | Introduction |
|---|
| 11 | ============ |
|---|
| 12 | |
|---|
| 13 | By default, Xapian orders search results by decreasing relevance score. |
|---|
| 14 | However, it also allows results to be ordered by other criteria, or |
|---|
| 15 | a mixture of other criteria and relevance score. |
|---|
| 16 | |
|---|
| 17 | If two or more results compare equal by the sorting criteria, then their order |
|---|
| 18 | is decided by their document ids. By default, the document ids sort in |
|---|
| 19 | ascending order (so a lower document id is "better"), but this can be set |
|---|
| 20 | to descending using ``enquire.set_docid_order(enquire.DESCENDING);``. If you |
|---|
| 21 | have no preference, you can tell Xapian to use whatever order is most efficient |
|---|
| 22 | using ``enquire.set_docid_order(enquire.DONT_CARE);``. |
|---|
| 23 | |
|---|
| 24 | Sorting by Relevance |
|---|
| 25 | ==================== |
|---|
| 26 | |
|---|
| 27 | The BM25 weighting formula which Xapian uses by default has a number of |
|---|
| 28 | parameters. We have picked some default parameter values which do a good job |
|---|
| 29 | in general. The optimal values of these parameters depend on the data being |
|---|
| 30 | indexed and the type of queries being run, so you may be able to improve the |
|---|
| 31 | effectiveness of your search system by adjusting these values, but it's a |
|---|
| 32 | fiddly process to tune them so people tend not to bother. |
|---|
| 33 | |
|---|
| 34 | See the `BM25 documentation <bm25.html>`_ for more details of BM25. |
|---|
| 35 | |
|---|
| 36 | The other included weighting schemes are ``TradWeight`` and ``BoolWeight``. |
|---|
| 37 | |
|---|
| 38 | TradWeight implements the original probabilistic weighting formula, which is |
|---|
| 39 | essentially a special case of BM25 (it's BM25 with k2 = 0, k3 = 0, b = 1, and |
|---|
| 40 | min_normlen = 0, except that the weights are scaled by a constant factor). |
|---|
| 41 | |
|---|
| 42 | BoolWeight assigns a weight of 0 to all documents, so the ordering is |
|---|
| 43 | determined solely by other factors. |
|---|
| 44 | |
|---|
| 45 | You can also implement your own weighting scheme, provided it can be expressed |
|---|
| 46 | in the form of a sum over the matching terms, plus an extra term which depends |
|---|
| 47 | on term-independent statistics (such as the normalised document length). |
|---|
| 48 | |
|---|
| 49 | For example, here's an implementation of "coordinate matching" - each matching |
|---|
| 50 | term scores one point:: |
|---|
| 51 | |
|---|
| 52 | class CoordinateWeight : public Xapian::Weight { |
|---|
| 53 | public: |
|---|
| 54 | CoordinateWeight * clone() const { return new CoordinateWeight; } |
|---|
| 55 | CoordinateWeight() { } |
|---|
| 56 | ~CoordinateWeight() { } |
|---|
| 57 | |
|---|
| 58 | std::string name() const { return "Coord"; } |
|---|
| 59 | std::string serialise() const { return ""; } |
|---|
| 60 | CoordinateWeight * unserialise(const std::string &) const { |
|---|
| 61 | return new CoordinateWeight; |
|---|
| 62 | } |
|---|
| 63 | |
|---|
| 64 | Xapian::weight get_sumpart(Xapian::termcount, Xapian::doclength) const { |
|---|
| 65 | return 1; |
|---|
| 66 | } |
|---|
| 67 | Xapian::weight get_maxpart() const { return 1; } |
|---|
| 68 | |
|---|
| 69 | Xapian::weight get_sumextra(Xapian::doclength) const { return 0; } |
|---|
| 70 | Xapian::weight get_maxextra() const { return 0; } |
|---|
| 71 | |
|---|
| 72 | bool get_sumpart_needs_doclength() const { return false; } |
|---|
| 73 | }; |
|---|
| 74 | |
|---|
| 75 | .. FIXME: add a more complex example once user-defined weight classes can |
|---|
| 76 | see the statistics. |
|---|
| 77 | |
|---|
| 78 | Sorting by Other Properties |
|---|
| 79 | =========================== |
|---|
| 80 | |
|---|
| 81 | If you want to offer a "sort by date" feature, and can arrange for documents to |
|---|
| 82 | be indexed in date order (or a close-enough approximation), then you can |
|---|
| 83 | implement a very efficient "sort by date" feature by using a boolean search |
|---|
| 84 | (i.e. call ``enquire.set_weighting_scheme(Xapian::BoolWeight());``) with |
|---|
| 85 | ``enquire.set_docid_order(Xapian::Enquire::DESCENDING);`` (for newest first) or |
|---|
| 86 | ``enquire.set_docid_order(Xapian::Enquire::ASCENDING);`` (for oldest first). |
|---|
| 87 | There's no inherent reason why this technique can't be used for sorting by |
|---|
| 88 | something other than date, but it's usually much easier to arrange for new |
|---|
| 89 | documents to arrive in date order than in other orders. |
|---|
| 90 | |
|---|
| 91 | Sorting by Value |
|---|
| 92 | ---------------- |
|---|
| 93 | |
|---|
| 94 | You can order documents by comparing a specified document value. Note that the |
|---|
| 95 | comparison used compares the byte values in the value (i.e. it's a string sort |
|---|
| 96 | ignoring locale), so ``1`` < ``10`` < ``2``. If you want to encode the value |
|---|
| 97 | such that it sorts numerically, use ``Xapian::sortable_serialise()`` to encode |
|---|
| 98 | values at index time - this works equally will on integers and floating point |
|---|
| 99 | values:: |
|---|
| 100 | |
|---|
| 101 | Xapian::Document doc; |
|---|
| 102 | doc.add_value(0, Xapian::sortable_serialise(price)); |
|---|
| 103 | |
|---|
| 104 | There are three methods which are used to specify how the value is used to |
|---|
| 105 | sort, depending if/how you want relevance used in the ordering: |
|---|
| 106 | |
|---|
| 107 | * ``Enquire::set_sort_by_value()`` specifies the relevance doesn't affect the |
|---|
| 108 | ordering at all. |
|---|
| 109 | * ``Enquire::set_sort_by_value_then_relevance()`` specifies that relevance is |
|---|
| 110 | used for ordering any groups of documents for which the value is the same. |
|---|
| 111 | * ``Enquire::set_sort_by_relevance_then_value()`` specifies that documents are |
|---|
| 112 | ordered by relevance, and the value is only used to order groups of documents |
|---|
| 113 | with identical relevance values (note: the weight has to be the same, not |
|---|
| 114 | just the rounded percentage score). This method isn't very useful with the |
|---|
| 115 | default BM25 weighting, since it rarely assigns identical scores to |
|---|
| 116 | different documents. |
|---|
| 117 | |
|---|
| 118 | Sorting by Generated Key |
|---|
| 119 | ------------------------ |
|---|
| 120 | |
|---|
| 121 | To allow more elaborate sorting schemes, Xapian allows you to provide a functor |
|---|
| 122 | object subclassed from ``Xapian::Sorter`` which generates a sort key for each |
|---|
| 123 | matching document which is under consideration. This is called at most once |
|---|
| 124 | for each document, and then the generated sort keys are ordered by comparing |
|---|
| 125 | byte values (i.e. with a string sort ignoring locale). |
|---|
| 126 | |
|---|
| 127 | There's a standard subclass ``Xapian::MultiValueSorter`` which allows sorting |
|---|
| 128 | on more than one document value (so the first document value specified |
|---|
| 129 | determines the order except among groups which have the same value, when |
|---|
| 130 | the second document value specified is used, and so on). |
|---|
| 131 | |
|---|
| 132 | ``Xapian::Sorter`` can also be subclassed to offer features such as "sort by |
|---|
| 133 | geographical distance". A subclass could take a coordinate pair - e.g. |
|---|
| 134 | (latitude, longitude) - for the user's location and sort results using |
|---|
| 135 | coordinates stored in a document value so that the nearest results ranked |
|---|
| 136 | highest. |
|---|