Ticket #290: overview.rst

File overview.rst, 18.8 KB (added by Frank J Bruzzaniti, 15 years ago)

I've updated the doco, not sure if it was ment to be a diff. If so let me know

Line 
1==============
2Omega overview
3==============
4
5If you just want a very quick overview, you might prefer to read the
6`quick-start guide <quickstart.html>`_.
7
8Omega operates on a set of databases. Each database is created and updated
9separately using either omindex or `scriptindex <scriptindex.html>`_. You can
10search these databases (or any other Xapian database with suitable contents)
11via a web front-end provided by omega, a CGI application. A search can also be
12done over more than one database at once.
13
14There are separate documents covering `CGI parameters <cgiparams.html>`_, the
15`Term Prefixes <termprefixes.html>`_ which are conventionally used, and
16`OmegaScript <omegascript.html>`_, the language used to define omega's web
17interface. Omega ships with several OmegaScript templates and you can
18use these, modify them, or just write your own. See the "Supplied Templates"
19section below for details of the supplied templates.
20
21Omega parses queries using the ``Xapian::QueryParser`` class - for the supported
22syntax, see queryparser.html in the xapian-core documentation
23- available online at: http://www.xapian.org/docs/queryparser.html
24
25Term construction
26=================
27
28Documents within an omega database are stored with two types of terms:
29those used for probabilistic searching (the CGI parameter 'P'), and
30those used for boolean filtering (the CGI parameter 'B'). Boolean
31terms start with an initial capital letter denoting the 'group' of the
32term (e.g. 'M' for MIME type), while probabilistic terms are all
33lower-case, and are also stemmed before adding to the
34database.
35
36The "english" stemmer is used by default - you can configure this for omindex
37and scriptindex with "--stemmer LANGUAGE" (use 'none' to disable stemming, see
38omindex --help for the list of accepted language names). At search time you
39can configure the stemmer by adding $set{stemmer,LANGUAGE} to the top of you
40OmegaScript template.
41
42The two term types are used as follows when building the query:
43B(oolean) terms with the same prefix are ORed together, with all the
44different prefix groups being ANDed together. This is then FILTERed
45against the P(robabilistic) terms. This will look something like::
46
47 [ FILTER ]
48 / \
49 / \
50 P-terms [ AND ]
51 / | ... \
52 /
53 [ OR ]
54 / | ... \
55 B(F,1) B(F,2)...B(F,n)
56
57Where B(F,1) is the first boolean term with prefix F, and so on.
58
59The intent here is to allow filtering on arbitrary (and, typically,
60orthogonal) characteristics of the document. For instance, by adding
61boolean terms "Ttext/html", "Ttext/plain" and "P/press" you would be
62filtering the probabilistic search for only documents that are both in
63the "/press" site *and* which are either of MIME type text/html or
64text/plain. (See below for more information about sites.)
65
66If there is no probabilistic query, the boolean filter is promoted to
67be the query, and the weighting scheme is set to boolean. This has
68the effect of applying the boolean filter to the whole database.
69
70In order to add more boolean prefixes, you will need to alter the
71``index_file()`` function in omindex.cc. Currently omindex adds several
72useful ones, detailed below.
73
74Probabilistic terms are constructed from the title, body and keywords
75of a document. (Not all document types support all three areas of
76text.) Title terms are stored with position data starting at 0, body
77terms starting 100 beyond title terms, and keyword terms starting 100
78beyond body terms. This allows queries using positional data without
79causing false matches across the different types of term.
80
81Sites
82=====
83
84Within a database, Omega supports multiple sites. These are recorded
85using boolean terms (see 'Term construction', above) to allow
86filtering on them.
87
88Sites work by having all documents within them having a common base
89URL. For instance, you might have two sites, one for your press area
90and one for your product descriptions:
91
92 - \http://example.com/press/index.html
93 - \http://example.com/press/bigrelease.html
94 - \http://example.com/products/bigproduct.html
95 - \http://example.com/products/littleproduct.html
96
97You could index all documents within \http://example.com/press/ using a
98site of '/press', and all within \http://example.com/products/ using
99'/products'.
100
101Sites are also useful because omindex indexes documents through the
102file system, not by fetching from the web server. If you don't have a
103URL to file system mapping which puts all documents under one
104hierarchy, you'll need to index each separate section as a site.
105
106An obvious example of this is the way that many web servers map URLs
107of the form <\http://example.com/~<username>/> to a directory within
108that user's home directory (such as ~<username>/pub on a Unix
109system). In this case, you can index each user's home page separately,
110as a site of the form '/~<username>'. You can then use boolean
111filters to allow people to search only a specific home page (or a
112group of them), or omit such terms to search everyone's pages.
113
114Note that the site specified when you index is used to build the
115complete URL that the results page links to. Thus while sites will
116typically want to be relative to the hostname part of the URL (e.g.
117'/site' rather than '\http://example.com/site'), you can use them
118to have a single search across several different hostnames. This will
119still work if you actually store each distinct hostname in a different
120database.
121
122omindex operation
123=================
124
125omindex is fairly simple to use, for example::
126
127 omindex --db default --url http://example.com/ /var/www/example.com
128
129For a full list of command line options supported, see ``man omindex``
130or ``omindex --help``.
131
132You *must* specify the database to index into (it's created if it doesn't
133exist, but parent directories must exist). You will often also want to specify
134the base URL (which is used as the site, and can be relative to the hostname -
135starts '/' - or absolute - starts with a scheme, e.g.
136'\http://example.com/products/'). If not specified, the base URL defaults to
137``/``.
138
139You also need to tell omindex which directory to index. This should be
140either a single directory (in which case it is taken to be the
141directory base of the entire site being indexed), or as two arguments,
142the first being the directory base of the site being indexed, and the
143second being a relative directory within that to index.
144
145For instance, in the example above, if you separate your products by
146size, you might end up with:
147
148 - \http://example.com/press/index.html
149 - \http://example.com/press/bigrelease.html
150 - \http://example.com/products/large/bigproduct.html
151 - \http://example.com/products/small/littleproduct.html
152
153If the entire website is stored in the file system under the directory
154/www/example, then you would probably index the site in two
155passes, one for the '/press' site and one for the '/products' site. You
156might use the following commands::
157
158$ omindex -p --db /var/lib/omega/data/default --url /press /www/example/press
159$ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products
160
161If you add a new large products, but don't want to reindex the whole of
162the products section, you could do::
163
164$ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products large
165
166and just the large products will be reindexed. You need to do it like that, and
167not as::
168
169$ omindex -p --db /var/lib/omega/data/default --url /products/large /www/example/products/large
170
171because that would make the large products part of a new site,
172'/products/large', which is unlikely to be what you want, as large
173products would no longer come up in a search of the products
174site. (Note that the --depth-limit option may come in handy if you have
175sites '/products' and '/products/large', or similar.)
176
177omindex has built-in support for indexing HTML, PHP, text files, and AbiWord
178documents. It can also index a number of other formats using external
179programs. Filter programs are run with CPU and memory limits to prevent a
180runaway filter from blocking indexing of other files.
181
182The following formats are currently supported (if you know of a reliable
183filter which can extract text from another useful file format, please let us
184know):
185
186* HTML (.html, .htm, .shtml)
187* PHP (.php) - our HTML parser knows to ignore PHP code
188* text files (.txt, .text)
189* PDF (.pdf) if pdftotext is available (comes with xpdf)
190* PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes
191 with xpdf) are available
192* OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm,
193 .sxw, .sxg, .stw) if unzip is available
194* OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb,
195 .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is
196 available
197* MS Word documents (.docx) and (.doc, .dot) if antiword is available
198* MS Excel documents (.xlsx) and (.xls, .xlb, .xlt) if xls2csv is available (comes with catdoc)
199* MS Powerpoint documents (.pptx) and (.ppt, .pps) if catppt is available (comes with catdoc)
200* Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
201* MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
202* AbiWord documents (.abw)
203* Compressed AbiWord documents (.zabw) if gzip is available
204* Rich Text Format documents (.rtf) if unrtf is available
205* Perl POD documentation (.pl, .pm, .pod) if pod2text is available
206* TeX DVI files (.dvi) if catdvi is available
207* DjVu files (.djv, .djvu) if djvutxt is available
208
209If you have additional extensions that represent one of these types, you need
210to add an additional MIME mapping using the --mime-type option. For instance::
211
212$ omindex --db /var/lib/omega/data/default --url /press /www/example/press --mime-type doc:application/postscript
213
214The syntax of --mime-type is 'ext:type', where ext is the extension of
215a file of that type (everything after the last '.'), and type is one
216of:
217
218 - text/html
219 - text/plain
220 - text/rtf
221 - text/x-perl
222 - application/msword
223 - application/pdf
224 - application/postscript
225 - application/vnd.ms-excel
226 - application/vnd.ms-powerpoint
227 - application/vnd.ms-works
228 - application/vnd.oasis.opendocument.text
229 - application/vnd.oasis.opendocument.spreadsheet
230 - application/vnd.oasis.opendocument.presentation
231 - application/vnd.oasis.opendocument.graphics
232 - application/vnd.oasis.opendocument.chart
233 - application/vnd.oasis.opendocument.formula
234 - application/vnd.oasis.opendocument.database
235 - application/vnd.oasis.opendocument.image
236 - application/vnd.oasis.opendocument.text-master
237 - application/vnd.oasis.opendocument.text-template
238 - application/vnd.oasis.opendocument.spreadsheet-template
239 - application/vnd.oasis.opendocument.presentation-template
240 - application/vnd.oasis.opendocument.graphics-template
241 - application/vnd.oasis.opendocument.chart-template
242 - application/vnd.oasis.opendocument.formula-template
243 - application/vnd.oasis.opendocument.image-template
244 - application/vnd.oasis.opendocument.text-web
245 - application/vnd.sun.xml.calc
246 - application/vnd.sun.xml.calc.template
247 - application/vnd.sun.xml.draw
248 - application/vnd.sun.xml.draw.template
249 - application/vnd.sun.xml.impress
250 - application/vnd.sun.xml.impress.template
251 - application/vnd.sun.xml.math
252 - application/vnd.sun.xml.writer
253 - application/vnd.sun.xml.writer.global
254 - application/vnd.sun.xml.writer.template
255 - application/vnd.wordperfect
256 - application/x-abiword
257 - application/x-abiword-compressed
258 - application/x-dvi
259 - image/vnd.djvu
260
261If you wish to remove a MIME mapping, you can do this by omitting the type -
262for example to not index .doc files, use: --mime-type doc:
263
264The lookup of extensions in the MIME mappings is case sensitive, but if an
265extension isn't found and includes upper case ASCII letters, they're converted
266to lower case and the lookup is repeated, so you effectively get case
267insensitive lookup for mappings specified with a lower-case extension, but
268you can set different handling for differently cased variants if you need
269to.
270
271--duplicates configures how omindex handles duplicates (detected on
272URL). 'ignore' means to ignore a document if it already appears to be
273in the database; 'replace' means to replace the document in the
274database with a new one by indexing this file, and 'duplicate' means
275to index this file as a new document, leaving the previous one in the
276database as well. The last strategy is very fast, but is liable to do
277strange things to your results set. In general, 'ignore' is useful for
278completely static documents (e.g. archive sites), while 'replace' is
279the most generally useful.
280
281With 'replace', omindex will remove any document it finds in the
282database that it did not update - in other words, it will clear out
283everything that doesn't exist any more. However if you are building up
284an omega database with several runs of omindex, this is not
285appropriate (as each run would delete the data from the previous run),
286so you should use the --preserve-nonduplicates. Note that if you
287choose to work like this, it is impossible to prune old documents from
288the database using omindex. If this is a problem for you, an
289alternative is to index each subsite into a different database, and
290merge all the databases together when searching.
291
292--depth-limit allows you to prevent omindex from descending more than
293a certain number of directories. If you wish to replicate the old
294--no-recurse option, use ----depth-limit=1.
295
296HTML Parsing
297============
298
299The document ``<title>`` tag is used as the document title, the 'description'
300META tag (if present) is used for the document snippet, and the 'keywords'
301META tag (if present) is indexed as extra document text.
302
303The HTML parser will look for the 'robots' META tag, and won't index pages
304which are marked as ``noindex`` or ``none``, for example any of the following::
305
306 <meta name="robots" content="noindex,nofollow">
307 <meta name="robots" content="noindex">
308 <meta name="robots" content="none">
309
310The parser also understand ht://dig comments to mark sections of the document
311to not index (for example, you can use this to avoid indexing navigation links
312or standard headers/footers) - for example::
313
314 Index this bit <!--htdig_noindex-->but <b>not</b> this<!--/htdig_noindex>
315
316Boolean terms
317=============
318
319omindex will create the following boolean terms when it indexes a
320document:
321
322T
323 MIME type
324H
325 hostname of site (if supplied - this term won't exist if you index a
326 site with base URL '/press', for instance)
327P
328 path of site (i.e. the rest of the site base URL)
329U
330 full URL of indexed document - if the resulting term would be > 240
331 characters, a hashing scheme is used to prevent omindex overflowing
332 the Xapian term length limit.
333
334
335
336D
337 date (numeric format: YYYYMMDD)
338 date can also have the magical form "latest" - a document indexed
339 by the term Dlatest matches any date-range without an end date.
340 You can index dynamic documents which are always up to date
341 with Dlatest and they'll match as expected. (If you use sort by date,
342 you'll probably also want to set the value containing the timestamp to
343 a "max" value so dynamic documents match a date in the far future).
344M
345 month (numeric format: YYYYMM)
346Y
347 year (four digits)
348
349omega configuration
350===================
351
352Most of the omega CGI configuration is dynamic, by setting CGI
353parameters. However some things must be configured using a
354configuration file. The configuration file is searched for in
355various locations:
356
357 - Firstly, if the "OMEGA_CONFIG_FILE" environment variable is
358 set, its value is used as the full path to a configuration file
359 to read.
360 - Next (if the environment variable is not set, or the file pointed
361 to is not present), the file "omega.conf" in the same directory as
362 the Omega CGI is used.
363 - Next (if neither of the previous steps found a file), the file
364 "${sysconfdir}/omega.conf" (e.g. /etc/omega.conf on Linux systems)
365 is used.
366 - Finally, if no configuration file is found, default values are used.
367
368The format of the file is very simple: a line per option, with the
369option name followed by its value, separated by a whitespace. Blank
370lines are ignored. If the first non-whitespace character on a line
371is a '#', omega treats the line as a comment and ignores it.
372
373The current options are 'database_dir' (the directory containing all the
374Omega databases), 'template_dir' (the directory containing the OmegaScript
375templates), and 'log_dir' (the directory which the OmegaScript $log command
376writes log files to).
377
378The default values (used if no configuration file is found) are::
379
380 database_dir /var/lib/omega/data
381 template_dir /var/lib/omega/templates
382 log_dir /var/log/omega
383
384Note that, with apache, environment variables may be set using mod_env, and
385with apache 1.3.7 or later this may be used inside a .htaccess file. This
386makes it reasonably easy to share a single system installed copy of Omega
387between multiple users.
388
389Supplied Templates
390==================
391
392The OmegaScript templates supplied with Omega are:
393
394 * query - This is the default template, providing a typical Web search
395 interface.
396 * topterms - This is just like query, but provides a "top terms" feature
397 which suggests terms the user might want to add to their query to
398 obtain better results.
399 * godmode - Allows you to inspect a database showing which terms index
400 each document, and which documents are indexed by each term.
401 * opensearch - Provides results in OpenSearch format (for more details
402 see http://www.opensearch.org/).
403 * xml - Provides results in a custom XML format.
404
405There are also "helper fragments" used by the templates above:
406
407 * inc/anyalldropbox - Provides a choice of matching "any" or "all" terms
408 by default as a drop down box.
409 * inc/anyallradio - Provides a choice of matching "any" or "all" terms
410 by default as radio buttons.
411 * toptermsjs - Provides some JavaScript used by the topterms template.
412
413Document data construction
414==========================
415
416This is only useful if you need to inject your own documents into the
417database independently of omindex, such as if you are indexing
418dynamically-generated documents that are served using a server-side
419system such as PHP or ASP, but which you can determine the contents of
420in some way, such as documents generated from reasonably static
421database contents.
422
423The document data field stores some summary information about the
424document, in the following (sample) format::
425
426 url=<baseurl>
427 sample=<sample>
428 caption=<title>
429 type=<mimetype>
430
431Further fields may be added (although omindex doesn't currently add any
432others), and may be looked up from OmegaScript using the $field{}
433command.
434
435As of Omega 0.9.3, you can alternatively add something like this near the
436start of your OmegaScript template::
437
438$set{fieldnames,$split{caption sample url}}
439
440Then you need only give the field values in the document data, which can
441save a lot of space in a large database. With the setting of fieldnames
442above, the first line of document data can be accessed with $field{caption},
443the second with $field{sample}, and the third with $field{url}.