root / tags / 1.0.8 / xapian-core / docs / categorisation.rst

Revision 9168, 5.1 kB (checked in by richard, 18 months ago)

docs/categorisation.rst: Fix typo.

Line 
1
2.. Copyright (C) 2007 Olly Betts
3
4=============================
5Xapian Categorisation Support
6=============================
7
8.. contents:: Table of contents
9
10Introduction
11============
12
13Xapian provides functionality which allows you to dynamically generate complete
14lists of category values which feature in matching documents.  There are
15numerous potential uses this can be put to, but a common one is to offer the
16user the ability to narrow down their search by filtering it to only include
17documents with a particular value of a particular category.
18
19Some categories are numeric and can take many different values (examples
20include price, width, and height).  The number of different values will often
21be overwhelming, and users will generally be more interested in narrowing their
22search to a range rather than a single value.  For these, Xapian can group the
23results into ranges for you.
24
25In some applications, you may have many different categories (for example
26colour, price, width, height) but not always want to offer all of them
27for every search.  If all the results are red, and none have width, it's
28not useful to offer to narrow the search by colour or width.  Also, the
29user interface may not have room to include every category, so you may
30want to select the "best" few categories to show the user.
31
32How to make use of the categorisation functionality
33===================================================
34
35Indexing
36--------
37
38When indexing a document, you need to add each category in a different
39number value slot.  For numeric values which you want to be able to
40group, you should encode the numeric value as a string using
41``Xapian::sortable_serialise()``.
42
43Searching
44---------
45
46At search time, you need to pass a ``Xapian::MatchSpy`` object to
47``Xapian::Enquire::get_mset()``, like so::
48
49    Xapian::MatchSpy spy;
50
51    spy.add_category(0);
52    spy.add_category(1);
53    spy.add_category(3);
54
55    Xapian::Enquire enq(db);
56
57    enq.set_query(query);
58
59    Xapian::MSet mset = enq.get_mset(0, 10, 10000, NULL, NULL, &spy);
60
61The ``10000`` in the call to ``get_mset`` tells Xapian to check at least
6210000 documents, so the ``spy`` object will be passed at least 10000 documents
63to tally category information from (unless less than 10000 documents match
64the query, in which case it will see all of them).  Setting this higher will
65make the counts exact, but Xapian will have to do more work for most queries
66so searches will be slower.
67
68The ``spy`` object now contains the category information.  You can find out
69how many documents it looked at by calling ``spy.get_total()``.  You can
70read the values for category ``cat_no`` like this::
71
72    const map<string, size_t> & cat = spy.get_categories(cat_no);
73    map<string, size_t>::const_iterator i;
74    for (i = cat.begin(); i != cat.end(); ++i) {
75        cout << i->first << ": " << i->second << endl;
76    }
77
78You calculate the score for category ``cat_no`` like so::
79
80    double score = spy.score_categorisation(cat_num);
81
82Or if you prefer categories with 4 or 5 values::
83
84    double score = spy.score_categorisation(cat_num, 4.5);
85
86The smaller the score, the better - a perfectly even split with exactly the
87number of entries asked (or with no preference given for the number of entries)
88scores 0.  You should experiment to find a suitable threshold for your
89application, but to give you a rough idea, a suitable threshold is likely to be
90less than one.
91
92The scoring uses a sum of squared differences (currently that is - this should
93probably be regarded as an implementation detail which could change in the
94future if we find a better algorithm).
95
96You would build ranges from numeric values for value ``cat_no``, asking for at
97most ``num_ranges`` ranges like so::
98
99    bool result = spy.build_numeric_ranges(cat_no, num_ranges);
100
101If ranges could not be built (for example, because all documents have the
102same value for ``cat_no``), ``false`` is returned.  Otherwise ``true`` is
103returned, and the spy object's category map for value ``cat_no`` is modified
104to consist of ranges.  Keys are now built of strings returned by
105``Xapian::sortable_serialise()`` - either a single string if there is only
106one number in a particular range, or for a range a string padded to 9 bytes
107with zero bytes, with a second string appended.
108
109Restricting by category values
110------------------------------
111
112If you're using the categorisation to offer the user choices for narrowing
113down their search results, you then need to be able to apply a suitable
114filter.
115
116For a range, the best way is to use ``Xapian::Query::OP_VALUE_RANGE`` to
117build a filter query, and then combine this with the user's query using
118``Xapian::Query::OP_FILTER``.
119
120For a single value, you could use ``Xapian::Query::OP_VALUE_RANGE`` with
121the same start and end, or ``Xapian::MatchDecider``, but it's probably
122most efficient to also index the categories as suitably prefixed boolean
123terms and use those for filtering.
124
125Current Limitations
126===================
127
128It's not currently possible to build logarithmic ranges without writing
129your own subclass.
130
131It's not possible to try building different ranges because the original
132data is overwritten.  If it's actually useful to do this, the API needs
133adjusting.
Note: See TracBrowser for help on using the browser.