| 1 | |
|---|
| 2 | .. Copyright (C) 2007 Olly Betts |
|---|
| 3 | |
|---|
| 4 | ============================= |
|---|
| 5 | Xapian Categorisation Support |
|---|
| 6 | ============================= |
|---|
| 7 | |
|---|
| 8 | .. contents:: Table of contents |
|---|
| 9 | |
|---|
| 10 | Introduction |
|---|
| 11 | ============ |
|---|
| 12 | |
|---|
| 13 | Xapian provides functionality which allows you to dynamically generate complete |
|---|
| 14 | lists of category values which feature in matching documents. There are |
|---|
| 15 | numerous potential uses this can be put to, but a common one is to offer the |
|---|
| 16 | user the ability to narrow down their search by filtering it to only include |
|---|
| 17 | documents with a particular value of a particular category. |
|---|
| 18 | |
|---|
| 19 | Some categories are numeric and can take many different values (examples |
|---|
| 20 | include price, width, and height). The number of different values will often |
|---|
| 21 | be overwhelming, and users will generally be more interested in narrowing their |
|---|
| 22 | search to a range rather than a single value. For these, Xapian can group the |
|---|
| 23 | results into ranges for you. |
|---|
| 24 | |
|---|
| 25 | In some applications, you may have many different categories (for example |
|---|
| 26 | colour, price, width, height) but not always want to offer all of them |
|---|
| 27 | for every search. If all the results are red, and none have width, it's |
|---|
| 28 | not useful to offer to narrow the search by colour or width. Also, the |
|---|
| 29 | user interface may not have room to include every category, so you may |
|---|
| 30 | want to select the "best" few categories to show the user. |
|---|
| 31 | |
|---|
| 32 | How to make use of the categorisation functionality |
|---|
| 33 | =================================================== |
|---|
| 34 | |
|---|
| 35 | Indexing |
|---|
| 36 | -------- |
|---|
| 37 | |
|---|
| 38 | When indexing a document, you need to add each category in a different |
|---|
| 39 | number value slot. For numeric values which you want to be able to |
|---|
| 40 | group, you should encode the numeric value as a string using |
|---|
| 41 | ``Xapian::sortable_serialise()``. |
|---|
| 42 | |
|---|
| 43 | Searching |
|---|
| 44 | --------- |
|---|
| 45 | |
|---|
| 46 | At search time, you need to pass a ``Xapian::MatchSpy`` object to |
|---|
| 47 | ``Xapian::Enquire::get_mset()``, like so:: |
|---|
| 48 | |
|---|
| 49 | Xapian::MatchSpy spy; |
|---|
| 50 | |
|---|
| 51 | spy.add_category(0); |
|---|
| 52 | spy.add_category(1); |
|---|
| 53 | spy.add_category(3); |
|---|
| 54 | |
|---|
| 55 | Xapian::Enquire enq(db); |
|---|
| 56 | |
|---|
| 57 | enq.set_query(query); |
|---|
| 58 | |
|---|
| 59 | Xapian::MSet mset = enq.get_mset(0, 10, 10000, NULL, NULL, &spy); |
|---|
| 60 | |
|---|
| 61 | The ``10000`` in the call to ``get_mset`` tells Xapian to check at least |
|---|
| 62 | 10000 documents, so the ``spy`` object will be passed at least 10000 documents |
|---|
| 63 | to tally category information from (unless less than 10000 documents match |
|---|
| 64 | the query, in which case it will see all of them). Setting this higher will |
|---|
| 65 | make the counts exact, but Xapian will have to do more work for most queries |
|---|
| 66 | so searches will be slower. |
|---|
| 67 | |
|---|
| 68 | The ``spy`` object now contains the category information. You can find out |
|---|
| 69 | how many documents it looked at by calling ``spy.get_total()``. You can |
|---|
| 70 | read the values for category ``cat_no`` like this:: |
|---|
| 71 | |
|---|
| 72 | const map<string, size_t> & cat = spy.get_categories(cat_no); |
|---|
| 73 | map<string, size_t>::const_iterator i; |
|---|
| 74 | for (i = cat.begin(); i != cat.end(); ++i) { |
|---|
| 75 | cout << i->first << ": " << i->second << endl; |
|---|
| 76 | } |
|---|
| 77 | |
|---|
| 78 | You calculate the score for category ``cat_no`` like so:: |
|---|
| 79 | |
|---|
| 80 | double score = spy.score_categorisation(cat_num); |
|---|
| 81 | |
|---|
| 82 | Or if you prefer categories with 4 or 5 values:: |
|---|
| 83 | |
|---|
| 84 | double score = spy.score_categorisation(cat_num, 4.5); |
|---|
| 85 | |
|---|
| 86 | The smaller the score, the better - a perfectly even split with exactly the |
|---|
| 87 | number of entries asked (or with no preference given for the number of entries) |
|---|
| 88 | scores 0. You should experiment to find a suitable threshold for your |
|---|
| 89 | application, but to give you a rough idea, a suitable threshold is likely to be |
|---|
| 90 | less than one. |
|---|
| 91 | |
|---|
| 92 | The scoring uses a sum of squared differences (currently that is - this should |
|---|
| 93 | probably be regarded as an implementation detail which could change in the |
|---|
| 94 | future if we find a better algorithm). |
|---|
| 95 | |
|---|
| 96 | You would build ranges from numeric values for value ``cat_no``, asking for at |
|---|
| 97 | most ``num_ranges`` ranges like so:: |
|---|
| 98 | |
|---|
| 99 | bool result = spy.build_numeric_ranges(cat_no, num_ranges); |
|---|
| 100 | |
|---|
| 101 | If ranges could not be built (for example, because all documents have the |
|---|
| 102 | same value for ``cat_no``), ``false`` is returned. Otherwise ``true`` is |
|---|
| 103 | returned, and the spy object's category map for value ``cat_no`` is modified |
|---|
| 104 | to consist of ranges. Keys are now built of strings returned by |
|---|
| 105 | ``Xapian::sortable_serialise()`` - either a single string if there is only |
|---|
| 106 | one number in a particular range, or for a range a string padded to 9 bytes |
|---|
| 107 | with zero bytes, with a second string appended. |
|---|
| 108 | |
|---|
| 109 | Restricting by category values |
|---|
| 110 | ------------------------------ |
|---|
| 111 | |
|---|
| 112 | If you're using the categorisation to offer the user choices for narrowing |
|---|
| 113 | down their search results, you then need to be able to apply a suitable |
|---|
| 114 | filter. |
|---|
| 115 | |
|---|
| 116 | For a range, the best way is to use ``Xapian::Query::OP_VALUE_RANGE`` to |
|---|
| 117 | build a filter query, and then combine this with the user's query using |
|---|
| 118 | ``Xapian::Query::OP_FILTER``. |
|---|
| 119 | |
|---|
| 120 | For a single value, you could use ``Xapian::Query::OP_VALUE_RANGE`` with |
|---|
| 121 | the same start and end, or ``Xapian::MatchDecider``, but it's probably |
|---|
| 122 | most efficient to also index the categories as suitably prefixed boolean |
|---|
| 123 | terms and use those for filtering. |
|---|
| 124 | |
|---|
| 125 | Current Limitations |
|---|
| 126 | =================== |
|---|
| 127 | |
|---|
| 128 | It's not currently possible to build logarithmic ranges without writing |
|---|
| 129 | your own subclass. |
|---|
| 130 | |
|---|
| 131 | It's not possible to try building different ranges because the original |
|---|
| 132 | data is overwritten. If it's actually useful to do this, the API needs |
|---|
| 133 | adjusting. |
|---|