root / tags / 1.0.8 / xapian-core / docs / termgenerator.rst

Revision 9640, 3.2 kB (checked in by richard, 14 months ago)

docs/termgenerator.rst: Change a couple of instances of "terms"
to "words" for clarity; the item from the input text is a word,
and the result of processing is a term.

Line 
1.. Copyright (C) 2007 Olly Betts
2
3========================================
4Xapian 1.0 Term Indexing/Querying Scheme
5========================================
6
7.. contents:: Table of contents
8
9Introduction
10============
11
12In Xapian 1.0, the default indexing scheme has been changed significantly, to address
13lessons learned from observing the old scheme in real world use.  This document
14describes the new scheme, with references to differences from the old.
15
16Stemming
17========
18
19The most obvious difference is the handling of stemmed forms.
20
21Previously all words were indexed stemmed without a prefix, and capitalised words were
22indexed unstemmed (but lower cased) with an 'R' prefix.  The rationale for doing this was
23that people want to be able to search for exact proper nouns (e.g. the English stemmer
24conflates ``Tony`` and ``Toni``).  But of course this also indexes words at the start
25of sentences, words in titles, and in German all nouns are capitalised so will be indexed.
26Both the normal and R-prefixed terms were indexed with positional information.
27
28Now we index all words lowercased with positional information, and also stemmed with a
29'Z' prefix (unless they start with a digit), but without positional information.  By default
30a Xapian::Stopper is used to avoid indexed stemmed forms of stopwords (tests show this shaves
31around 1% off the database size).
32
33The new scheme allows exact phrase searching (which the old scheme didn't).  ``NEAR``
34now has to operate on unstemmed forms, but that's reasonable enough.  We can also disable
35stemming of words which are capitalised in the query, to achieve good results for
36proper nouns.  And Omega's $topterms will now always suggest unstemmed forms!
37
38The main rationale for prefixing the stemmed forms is that there are simply fewer of
39them!  As a side benefit, it opens the way for storing stemmed forms for multiple
40languages (e.g. Z:en:, Z:fr: or something like that).
41
42The special handling of a trailing ``.`` in the QueryParser (which would often
43mistakenly trigger for pasted text) has been removed.  This feature was there to
44support Omega's topterms adding stemmed forms, but Omega no longer needs to do this
45as it can suggest unstemmed forms instead.
46
47Word Characters
48===============
49
50By default, Unicode characters of category CONNECTOR_PUNCTUATION (``_`` and a
51handful of others) are now word characters, which provides better indexing of
52identifiers, without much degradation of other cases.  Previously cases like
53``time_t`` required a phrase search.
54
55Trailing ``+`` and ``#`` are still included on terms (up to 3 characters at most), but
56``-`` no longer is by default.  The examples it benefits aren't compelling
57(``nethack--``, ``Cl-``) and it tends to glue hyphens on to terms.
58
59A single embedded ``'`` (apostrophe) is now included in a term.
60Previously this caused a slow phrase search, and added junk terms to the index
61(``didn't`` -> ``didn`` and ``t``, etc).  Various Unicode characters used for apostrophes
62are all mapped to the ASCII representation.
63
64A few other characters (taken from the Unicode definition of a word) are included
65in terms if they occur between two word characters, and ``.``, ``,`` and a
66few others are included in terms if they occur between two decimal digit characters.
Note: See TracBrowser for help on using the browser.