| 1 | .. Copyright (C) 2007 Jenny Black |
|---|
| 2 | .. Copyright (C) 2007 Olly Betts |
|---|
| 3 | .. Copyright (C) 2007 Deron Meranda |
|---|
| 4 | |
|---|
| 5 | ======== |
|---|
| 6 | Glossary |
|---|
| 7 | ======== |
|---|
| 8 | |
|---|
| 9 | This glossary defines specialized terminology you may encounter while using |
|---|
| 10 | Xapian. Some of the entries are standard in the field of Information |
|---|
| 11 | Retrieval, while others have a specific meaning in the context of Xapian. |
|---|
| 12 | |
|---|
| 13 | .. The first sentence should ideally work alone to allow us to reuse these |
|---|
| 14 | .. in the future to generate pop-up information when the user moves the mouse |
|---|
| 15 | .. over the term used in the documentation. |
|---|
| 16 | |
|---|
| 17 | **BM25** |
|---|
| 18 | The weighting scheme which Xapian uses by default. BM25 is a refinement on |
|---|
| 19 | the original probabilistic weighting scheme, and recent TREC tests have shown |
|---|
| 20 | BM25 to be the best of the known probabilistic weighting schemes. It's |
|---|
| 21 | sometimes known as "Okapi BM25" since it was first implemented in an |
|---|
| 22 | academic IR system called Okapi. |
|---|
| 23 | |
|---|
| 24 | **Boolean Retrieval** |
|---|
| 25 | Retrieving the set of documents that match a boolean query (e.g. a |
|---|
| 26 | list of terms joined with a combination of operators such as AND, OR, |
|---|
| 27 | AND_NOT). In many systems, these documents are not ranked according to their |
|---|
| 28 | relevance. In Xapian, a pure Boolean query may be used, or alternatively a |
|---|
| 29 | Boolean style query can filter the retrieved documents, which are then ordered |
|---|
| 30 | using a probabilistic ranking. |
|---|
| 31 | |
|---|
| 32 | **Database** |
|---|
| 33 | In Xapian (as opposed to a relational database system) a database consists of |
|---|
| 34 | little more than indexed documents: this reflects the purpose of Xapian as an |
|---|
| 35 | information retrieval system, rather than an information storage system. |
|---|
| 36 | These may also occasionally be called Indexes. Flint is the backend used from |
|---|
| 37 | Xapian 1.0 onwards, quartz was used in older versions. |
|---|
| 38 | |
|---|
| 39 | **Document ID** |
|---|
| 40 | A unique positive integer identifying a document in a Xapian database. |
|---|
| 41 | |
|---|
| 42 | **Document data** |
|---|
| 43 | The document data is one of several types of information that can be |
|---|
| 44 | associated with each document, the contents can be set to be anything in any |
|---|
| 45 | format, examples include fields such as URL, document title, and an excerpt of |
|---|
| 46 | text from the document. If you wish to interpolate with Omega, it should |
|---|
| 47 | contain name=value pairs, one per line (recent versions of Omega also support |
|---|
| 48 | one field value per line, and can assign names to line numbers in the |
|---|
| 49 | query template). |
|---|
| 50 | |
|---|
| 51 | **Document** |
|---|
| 52 | These are the items that are being retrieved. Often they will be text |
|---|
| 53 | documents (e.g. web pages, email messages, word processor documents) |
|---|
| 54 | but they could be sections within such a document, or photos, video, music, |
|---|
| 55 | user profiles, or anything else you want to index. |
|---|
| 56 | |
|---|
| 57 | **Edit distance** |
|---|
| 58 | A measure of how many "edits" are required to turn one text string into |
|---|
| 59 | another, used to suggest spelling corrections. The algorithm Xapian uses |
|---|
| 60 | counts an edit as any of inserting a character, deleting a character, |
|---|
| 61 | changing a character, or transposing two adjacent characters. |
|---|
| 62 | |
|---|
| 63 | **ESet (Expand Set)** |
|---|
| 64 | The Expand Set (ESet) is a ranked list of terms that could be used to expand |
|---|
| 65 | the original query. These terms are those which are statistically good |
|---|
| 66 | differentiators between relevant and non-relevant documents. |
|---|
| 67 | |
|---|
| 68 | **Flint** |
|---|
| 69 | Flint is the current database format used in Xapian. It's the default from |
|---|
| 70 | Xapian 1.0 onwards, replacing Quartz. Flint is very efficient and highly |
|---|
| 71 | scalable. It supports incremental modifications, and concurrent single-writer |
|---|
| 72 | and multiple-reader access to a database. |
|---|
| 73 | |
|---|
| 74 | **Index** |
|---|
| 75 | If a document is described by a term, this term is said to index the document. |
|---|
| 76 | Also, the database in Xapian and other IR systems is sometimes called an index |
|---|
| 77 | (by analogy with the index in the back of a book). |
|---|
| 78 | |
|---|
| 79 | **Indexer** |
|---|
| 80 | The indexer takes documents (in various formats) and processes them so that they |
|---|
| 81 | can be searched efficiently, they are then stored in the database. |
|---|
| 82 | |
|---|
| 83 | **Information Need** |
|---|
| 84 | The information need is what the user is looking for. They will usually |
|---|
| 85 | attempt to express this as a query string. |
|---|
| 86 | |
|---|
| 87 | **Information Retrieval (IR)** |
|---|
| 88 | Information Retrieval is the "science of search". It's the name used to |
|---|
| 89 | refer to the study of search and related topics in academia. |
|---|
| 90 | |
|---|
| 91 | **MSet (Match Set)** |
|---|
| 92 | The Match Set (MSet) is a ranked list of documents resulting from a query. |
|---|
| 93 | The list is ranked according to document weighting, so the top document has |
|---|
| 94 | the highest probability of relevance, the second document the second highest, |
|---|
| 95 | and so on. The number of documents in the MSet can be controlled, so it does |
|---|
| 96 | not usually contain all of the matching documents. |
|---|
| 97 | |
|---|
| 98 | **Normalised document length (ndl)** |
|---|
| 99 | The normalised document length (ndl) is the length of a document (the number |
|---|
| 100 | of terms it contains) divided by the average length of the documents |
|---|
| 101 | within the system. So an average length document would have ndl equal to 1, |
|---|
| 102 | while shorter documents have ndl less than 1, and longer documents greater |
|---|
| 103 | than 1. |
|---|
| 104 | |
|---|
| 105 | **Omega** |
|---|
| 106 | Omega comprises two indexers and a CGI search application built using the |
|---|
| 107 | Xapian library. |
|---|
| 108 | |
|---|
| 109 | **Posting List** |
|---|
| 110 | A posting list is a list of the documents which a specific term indexes. This |
|---|
| 111 | can be thought of as a list of numbers - the document IDs. |
|---|
| 112 | |
|---|
| 113 | **Posting** |
|---|
| 114 | An instance of a particular term indexing a particular document. |
|---|
| 115 | |
|---|
| 116 | **Precision** |
|---|
| 117 | Precision is the density of relevant documents amongst those retrieved: the |
|---|
| 118 | number of relevant documents returned divided by the total number of documents |
|---|
| 119 | returned. |
|---|
| 120 | |
|---|
| 121 | **Probabilistic IR** |
|---|
| 122 | Probabilistic IR is retrieval based on probability theory, this can produce a |
|---|
| 123 | ranked list of documents based upon relevance. Xapian uses probabilistic |
|---|
| 124 | methods (the only exception is when a pure Boolean query is chosen) |
|---|
| 125 | |
|---|
| 126 | **Quartz** |
|---|
| 127 | Quartz was the database format used by Xapian prior to version 1.0. It is |
|---|
| 128 | now deprecated, and support will be dropped in some future Xapian release. |
|---|
| 129 | New installations should use Flint, and existing installations should consider |
|---|
| 130 | migrating to Flint. |
|---|
| 131 | |
|---|
| 132 | **Query** |
|---|
| 133 | A query is the information need expressed in a form that an IR system can |
|---|
| 134 | read. It is usually a text string containing terms, and may include Boolean |
|---|
| 135 | operators such as AND or OR, etc. |
|---|
| 136 | |
|---|
| 137 | **Query Expansion** |
|---|
| 138 | Modifying a query in an attempt to broaden the search results. |
|---|
| 139 | |
|---|
| 140 | .. _rset: |
|---|
| 141 | |
|---|
| 142 | **RSet (Relevance Set)** |
|---|
| 143 | The Relevance Set (RSet) is the set of documents which have been marked by the |
|---|
| 144 | user as relevant. They can be used to suggest terms that the user may want to |
|---|
| 145 | add to the query (these terms form an ESet), and also to adjust term weights |
|---|
| 146 | to reorder query results. |
|---|
| 147 | |
|---|
| 148 | **Recall** |
|---|
| 149 | Recall is the proportion of relevant documents retrieved - the number of |
|---|
| 150 | relevant documents retrieved divided by the total number of relevant |
|---|
| 151 | documents. |
|---|
| 152 | |
|---|
| 153 | **Relevance** |
|---|
| 154 | Essentially, a document is relevant if it is what the user wanted. Ideally, |
|---|
| 155 | the retrieved documents will all be relevant, and the non-retrieved ones all |
|---|
| 156 | non-relevant. |
|---|
| 157 | |
|---|
| 158 | **Searcher** |
|---|
| 159 | The searcher is a part of the IR system, it takes queries and reads the |
|---|
| 160 | database to return a list of relevant documents. |
|---|
| 161 | |
|---|
| 162 | **Stemming** |
|---|
| 163 | A stemming algorithm performs linguistic normalisation by reducing variant |
|---|
| 164 | forms of a word to a common form. In English, this mainly involves removing |
|---|
| 165 | suffixes - such as converting any of the words "talking", "talks", or "talked" |
|---|
| 166 | to the stem form "talk". |
|---|
| 167 | |
|---|
| 168 | **Stop word** |
|---|
| 169 | A word which is ignored during indexing and/or searching, usually because it |
|---|
| 170 | is very common or doesn't convey meaning. For example, "the", "a", "to". |
|---|
| 171 | |
|---|
| 172 | **Synonyms** |
|---|
| 173 | Xapian can store synonyms for terms, and use these to implement one approach |
|---|
| 174 | to query expansion. |
|---|
| 175 | |
|---|
| 176 | **Term List** |
|---|
| 177 | A term list is the list of terms that index a specific document. In some |
|---|
| 178 | systems this may be a list of numbers (with each term represented by a number |
|---|
| 179 | internally), in Xapian it is a list of strings (the terms). |
|---|
| 180 | |
|---|
| 181 | **Term frequency** |
|---|
| 182 | The term frequency of a specific term is the number of documents in the system |
|---|
| 183 | that are indexed by that term. |
|---|
| 184 | |
|---|
| 185 | **Term** |
|---|
| 186 | A term is a string of bytes (often a word or word stem) which describes a |
|---|
| 187 | document. Terms are similar to the index entries found in the back of a book |
|---|
| 188 | and each document may be described by many terms. A query is composed from |
|---|
| 189 | a list of terms (perhaps linked by Boolean operators). |
|---|
| 190 | |
|---|
| 191 | **Term Prefix** |
|---|
| 192 | By convention, terms in Xapian can be prefixed to indicate a field in the |
|---|
| 193 | document which they come from, or some other form of type information. |
|---|
| 194 | The term prefix is usually a single capital letter. |
|---|
| 195 | |
|---|
| 196 | **Test Collection** |
|---|
| 197 | A test collection consists of a set of documents and a set of queries each of |
|---|
| 198 | which has a complete set of relevance assignments - this is used to test how |
|---|
| 199 | well different IR methods perform. |
|---|
| 200 | |
|---|
| 201 | **UTF-8** |
|---|
| 202 | A standard variable-length byte-oriented encoding for Unicode. |
|---|
| 203 | |
|---|
| 204 | **Value** |
|---|
| 205 | A discrete meta-data attribute attached to a document. Each document can |
|---|
| 206 | have many values, each stored in a different numbered slot. Values are |
|---|
| 207 | designed to be fast to access during the matching process, and can be used for |
|---|
| 208 | sorting, collapsing redundant documents, implementing ranges, and other uses. |
|---|
| 209 | If you're just wanting to store "fields" for displaying results, it's better |
|---|
| 210 | to store them in the document data. |
|---|
| 211 | |
|---|
| 212 | **Within-document frequency (wdf)** |
|---|
| 213 | The within-document frequency (wdf) of a term in a specific document is the |
|---|
| 214 | number of times it is pulled out of the document in the indexing process. |
|---|
| 215 | Usually this is the size of the wdp vector, but in Xapian it can exceed it, |
|---|
| 216 | since we can apply extra wdf to some parts of the document text. |
|---|
| 217 | |
|---|
| 218 | **Within-document positions (wdp)** |
|---|
| 219 | In the case where a term derives from words actually in the document, the |
|---|
| 220 | within-document positions (wdp) are the positions at which that word occurs |
|---|
| 221 | within the document. So if the term derives from a word that occurs three |
|---|
| 222 | times in the document as the fifth, 22nd and 131st word, the wdps will be 5, |
|---|
| 223 | 22 and 131. |
|---|
| 224 | |
|---|
| 225 | **Within-query frequency (wqf)** |
|---|
| 226 | The within-query frequency (wqf) is the number of times a term occurs in the |
|---|
| 227 | query. This statistic is used in the BM25 weighing scheme. |
|---|
| 228 | |
|---|
| 229 | .. wqp? nql? Is it is worth adding these - they're not referenced much. |
|---|