| 1 | .. Copyright (C) 2007 Olly Betts |
|---|
| 2 | |
|---|
| 3 | ======================================== |
|---|
| 4 | Xapian 1.0 Term Indexing/Querying Scheme |
|---|
| 5 | ======================================== |
|---|
| 6 | |
|---|
| 7 | .. contents:: Table of contents |
|---|
| 8 | |
|---|
| 9 | Introduction |
|---|
| 10 | ============ |
|---|
| 11 | |
|---|
| 12 | In Xapian 1.0, the default indexing scheme has been changed significantly, to address |
|---|
| 13 | lessons learned from observing the old scheme in real world use. This document |
|---|
| 14 | describes the new scheme, with references to differences from the old. |
|---|
| 15 | |
|---|
| 16 | Stemming |
|---|
| 17 | ======== |
|---|
| 18 | |
|---|
| 19 | The most obvious difference is the handling of stemmed forms. |
|---|
| 20 | |
|---|
| 21 | Previously all words were indexed stemmed without a prefix, and capitalised words were |
|---|
| 22 | indexed unstemmed (but lower cased) with an 'R' prefix. The rationale for doing this was |
|---|
| 23 | that people want to be able to search for exact proper nouns (e.g. the English stemmer |
|---|
| 24 | conflates ``Tony`` and ``Toni``). But of course this also indexes words at the start |
|---|
| 25 | of sentences, words in titles, and in German all nouns are capitalised so will be indexed. |
|---|
| 26 | Both the normal and R-prefixed terms were indexed with positional information. |
|---|
| 27 | |
|---|
| 28 | Now we index all words lowercased with positional information, and also stemmed with a |
|---|
| 29 | 'Z' prefix (unless they start with a digit), but without positional information. By default |
|---|
| 30 | a Xapian::Stopper is used to avoid indexed stemmed forms of stopwords (tests show this shaves |
|---|
| 31 | around 1% off the database size). |
|---|
| 32 | |
|---|
| 33 | The new scheme allows exact phrase searching (which the old scheme didn't). ``NEAR`` |
|---|
| 34 | now has to operate on unstemmed forms, but that's reasonable enough. We can also disable |
|---|
| 35 | stemming of words which are capitalised in the query, to achieve good results for |
|---|
| 36 | proper nouns. And Omega's $topterms will now always suggest unstemmed forms! |
|---|
| 37 | |
|---|
| 38 | The main rationale for prefixing the stemmed forms is that there are simply fewer of |
|---|
| 39 | them! As a side benefit, it opens the way for storing stemmed forms for multiple |
|---|
| 40 | languages (e.g. Z:en:, Z:fr: or something like that). |
|---|
| 41 | |
|---|
| 42 | The special handling of a trailing ``.`` in the QueryParser (which would often |
|---|
| 43 | mistakenly trigger for pasted text) has been removed. This feature was there to |
|---|
| 44 | support Omega's topterms adding stemmed forms, but Omega no longer needs to do this |
|---|
| 45 | as it can suggest unstemmed forms instead. |
|---|
| 46 | |
|---|
| 47 | Word Characters |
|---|
| 48 | =============== |
|---|
| 49 | |
|---|
| 50 | By default, Unicode characters of category CONNECTOR_PUNCTUATION (``_`` and a |
|---|
| 51 | handful of others) are now word characters, which provides better indexing of |
|---|
| 52 | identifiers, without much degradation of other cases. Previously cases like |
|---|
| 53 | ``time_t`` required a phrase search. |
|---|
| 54 | |
|---|
| 55 | Trailing ``+`` and ``#`` are still included on terms (up to 3 characters at most), but |
|---|
| 56 | ``-`` no longer is by default. The examples it benefits aren't compelling |
|---|
| 57 | (``nethack--``, ``Cl-``) and it tends to glue hyphens on to terms. |
|---|
| 58 | |
|---|
| 59 | A single embedded ``'`` (apostrophe) is now included in a term. |
|---|
| 60 | Previously this caused a slow phrase search, and added junk terms to the index |
|---|
| 61 | (``didn't`` -> ``didn`` and ``t``, etc). Various Unicode characters used for apostrophes |
|---|
| 62 | are all mapped to the ASCII representation. |
|---|
| 63 | |
|---|
| 64 | A few other characters (taken from the Unicode definition of a word) are included |
|---|
| 65 | in terms if they occur between two word characters, and ``.``, ``,`` and a |
|---|
| 66 | few others are included in terms if they occur between two decimal digit characters. |
|---|