1 | ==============
|
---|
2 | Omega overview
|
---|
3 | ==============
|
---|
4 |
|
---|
5 | If you just want a very quick overview, you might prefer to read the
|
---|
6 | `quick-start guide <quickstart.html>`_.
|
---|
7 |
|
---|
8 | Omega operates on a set of databases. Each database is created and updated
|
---|
9 | separately using either omindex or `scriptindex <scriptindex.html>`_. You can
|
---|
10 | search these databases (or any other Xapian database with suitable contents)
|
---|
11 | via a web front-end provided by omega, a CGI application. A search can also be
|
---|
12 | done over more than one database at once.
|
---|
13 |
|
---|
14 | There are separate documents covering `CGI parameters <cgiparams.html>`_, the
|
---|
15 | `Term Prefixes <termprefixes.html>`_ which are conventionally used, and
|
---|
16 | `OmegaScript <omegascript.html>`_, the language used to define omega's web
|
---|
17 | interface. Omega ships with several OmegaScript templates and you can
|
---|
18 | use these, modify them, or just write your own. See the "Supplied Templates"
|
---|
19 | section below for details of the supplied templates.
|
---|
20 |
|
---|
21 | Omega parses queries using the ``Xapian::QueryParser`` class - for the supported
|
---|
22 | syntax, see queryparser.html in the xapian-core documentation
|
---|
23 | - available online at: http://www.xapian.org/docs/queryparser.html
|
---|
24 |
|
---|
25 | Term construction
|
---|
26 | =================
|
---|
27 |
|
---|
28 | Documents within an omega database are stored with two types of terms:
|
---|
29 | those used for probabilistic searching (the CGI parameter 'P'), and
|
---|
30 | those used for boolean filtering (the CGI parameter 'B'). Boolean
|
---|
31 | terms start with an initial capital letter denoting the 'group' of the
|
---|
32 | term (e.g. 'M' for MIME type), while probabilistic terms are all
|
---|
33 | lower-case, and are also stemmed before adding to the
|
---|
34 | database.
|
---|
35 |
|
---|
36 | The "english" stemmer is used by default - you can configure this for omindex
|
---|
37 | and scriptindex with "--stemmer LANGUAGE" (use 'none' to disable stemming, see
|
---|
38 | omindex --help for the list of accepted language names). At search time you
|
---|
39 | can configure the stemmer by adding $set{stemmer,LANGUAGE} to the top of you
|
---|
40 | OmegaScript template.
|
---|
41 |
|
---|
42 | The two term types are used as follows when building the query:
|
---|
43 | B(oolean) terms with the same prefix are ORed together, with all the
|
---|
44 | different prefix groups being ANDed together. This is then FILTERed
|
---|
45 | against the P(robabilistic) terms. This will look something like::
|
---|
46 |
|
---|
47 | [ FILTER ]
|
---|
48 | / \
|
---|
49 | / \
|
---|
50 | P-terms [ AND ]
|
---|
51 | / | ... \
|
---|
52 | /
|
---|
53 | [ OR ]
|
---|
54 | / | ... \
|
---|
55 | B(F,1) B(F,2)...B(F,n)
|
---|
56 |
|
---|
57 | Where B(F,1) is the first boolean term with prefix F, and so on.
|
---|
58 |
|
---|
59 | The intent here is to allow filtering on arbitrary (and, typically,
|
---|
60 | orthogonal) characteristics of the document. For instance, by adding
|
---|
61 | boolean terms "Ttext/html", "Ttext/plain" and "P/press" you would be
|
---|
62 | filtering the probabilistic search for only documents that are both in
|
---|
63 | the "/press" site *and* which are either of MIME type text/html or
|
---|
64 | text/plain. (See below for more information about sites.)
|
---|
65 |
|
---|
66 | If there is no probabilistic query, the boolean filter is promoted to
|
---|
67 | be the query, and the weighting scheme is set to boolean. This has
|
---|
68 | the effect of applying the boolean filter to the whole database.
|
---|
69 |
|
---|
70 | In order to add more boolean prefixes, you will need to alter the
|
---|
71 | ``index_file()`` function in omindex.cc. Currently omindex adds several
|
---|
72 | useful ones, detailed below.
|
---|
73 |
|
---|
74 | Probabilistic terms are constructed from the title, body and keywords
|
---|
75 | of a document. (Not all document types support all three areas of
|
---|
76 | text.) Title terms are stored with position data starting at 0, body
|
---|
77 | terms starting 100 beyond title terms, and keyword terms starting 100
|
---|
78 | beyond body terms. This allows queries using positional data without
|
---|
79 | causing false matches across the different types of term.
|
---|
80 |
|
---|
81 | Sites
|
---|
82 | =====
|
---|
83 |
|
---|
84 | Within a database, Omega supports multiple sites. These are recorded
|
---|
85 | using boolean terms (see 'Term construction', above) to allow
|
---|
86 | filtering on them.
|
---|
87 |
|
---|
88 | Sites work by having all documents within them having a common base
|
---|
89 | URL. For instance, you might have two sites, one for your press area
|
---|
90 | and one for your product descriptions:
|
---|
91 |
|
---|
92 | - \http://example.com/press/index.html
|
---|
93 | - \http://example.com/press/bigrelease.html
|
---|
94 | - \http://example.com/products/bigproduct.html
|
---|
95 | - \http://example.com/products/littleproduct.html
|
---|
96 |
|
---|
97 | You could index all documents within \http://example.com/press/ using a
|
---|
98 | site of '/press', and all within \http://example.com/products/ using
|
---|
99 | '/products'.
|
---|
100 |
|
---|
101 | Sites are also useful because omindex indexes documents through the
|
---|
102 | file system, not by fetching from the web server. If you don't have a
|
---|
103 | URL to file system mapping which puts all documents under one
|
---|
104 | hierarchy, you'll need to index each separate section as a site.
|
---|
105 |
|
---|
106 | An obvious example of this is the way that many web servers map URLs
|
---|
107 | of the form <\http://example.com/~<username>/> to a directory within
|
---|
108 | that user's home directory (such as ~<username>/pub on a Unix
|
---|
109 | system). In this case, you can index each user's home page separately,
|
---|
110 | as a site of the form '/~<username>'. You can then use boolean
|
---|
111 | filters to allow people to search only a specific home page (or a
|
---|
112 | group of them), or omit such terms to search everyone's pages.
|
---|
113 |
|
---|
114 | Note that the site specified when you index is used to build the
|
---|
115 | complete URL that the results page links to. Thus while sites will
|
---|
116 | typically want to be relative to the hostname part of the URL (e.g.
|
---|
117 | '/site' rather than '\http://example.com/site'), you can use them
|
---|
118 | to have a single search across several different hostnames. This will
|
---|
119 | still work if you actually store each distinct hostname in a different
|
---|
120 | database.
|
---|
121 |
|
---|
122 | omindex operation
|
---|
123 | =================
|
---|
124 |
|
---|
125 | omindex is fairly simple to use, for example::
|
---|
126 |
|
---|
127 | omindex --db default --url http://example.com/ /var/www/example.com
|
---|
128 |
|
---|
129 | For a full list of command line options supported, see ``man omindex``
|
---|
130 | or ``omindex --help``.
|
---|
131 |
|
---|
132 | You *must* specify the database to index into (it's created if it doesn't
|
---|
133 | exist, but parent directories must exist). You will often also want to specify
|
---|
134 | the base URL (which is used as the site, and can be relative to the hostname -
|
---|
135 | starts '/' - or absolute - starts with a scheme, e.g.
|
---|
136 | '\http://example.com/products/'). If not specified, the base URL defaults to
|
---|
137 | ``/``.
|
---|
138 |
|
---|
139 | You also need to tell omindex which directory to index. This should be
|
---|
140 | either a single directory (in which case it is taken to be the
|
---|
141 | directory base of the entire site being indexed), or as two arguments,
|
---|
142 | the first being the directory base of the site being indexed, and the
|
---|
143 | second being a relative directory within that to index.
|
---|
144 |
|
---|
145 | For instance, in the example above, if you separate your products by
|
---|
146 | size, you might end up with:
|
---|
147 |
|
---|
148 | - \http://example.com/press/index.html
|
---|
149 | - \http://example.com/press/bigrelease.html
|
---|
150 | - \http://example.com/products/large/bigproduct.html
|
---|
151 | - \http://example.com/products/small/littleproduct.html
|
---|
152 |
|
---|
153 | If the entire website is stored in the file system under the directory
|
---|
154 | /www/example, then you would probably index the site in two
|
---|
155 | passes, one for the '/press' site and one for the '/products' site. You
|
---|
156 | might use the following commands::
|
---|
157 |
|
---|
158 | $ omindex -p --db /var/lib/omega/data/default --url /press /www/example/press
|
---|
159 | $ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products
|
---|
160 |
|
---|
161 | If you add a new large products, but don't want to reindex the whole of
|
---|
162 | the products section, you could do::
|
---|
163 |
|
---|
164 | $ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products large
|
---|
165 |
|
---|
166 | and just the large products will be reindexed. You need to do it like that, and
|
---|
167 | not as::
|
---|
168 |
|
---|
169 | $ omindex -p --db /var/lib/omega/data/default --url /products/large /www/example/products/large
|
---|
170 |
|
---|
171 | because that would make the large products part of a new site,
|
---|
172 | '/products/large', which is unlikely to be what you want, as large
|
---|
173 | products would no longer come up in a search of the products
|
---|
174 | site. (Note that the --depth-limit option may come in handy if you have
|
---|
175 | sites '/products' and '/products/large', or similar.)
|
---|
176 |
|
---|
177 | omindex has built-in support for indexing HTML, PHP, text files, and AbiWord
|
---|
178 | documents. It can also index a number of other formats using external
|
---|
179 | programs. Filter programs are run with CPU and memory limits to prevent a
|
---|
180 | runaway filter from blocking indexing of other files.
|
---|
181 |
|
---|
182 | The following formats are currently supported (if you know of a reliable
|
---|
183 | filter which can extract text from another useful file format, please let us
|
---|
184 | know):
|
---|
185 |
|
---|
186 | * HTML (.html, .htm, .shtml)
|
---|
187 | * PHP (.php) - our HTML parser knows to ignore PHP code
|
---|
188 | * text files (.txt, .text)
|
---|
189 | * PDF (.pdf) if pdftotext is available (comes with xpdf)
|
---|
190 | * PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes
|
---|
191 | with xpdf) are available
|
---|
192 | * OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm,
|
---|
193 | .sxw, .sxg, .stw) if unzip is available
|
---|
194 | * OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb,
|
---|
195 | .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is
|
---|
196 | available
|
---|
197 | * MS Word documents (.docx) and (.doc, .dot) if antiword is available
|
---|
198 | * MS Excel documents (.xlsx) and (.xls, .xlb, .xlt) if xls2csv is available (comes with catdoc)
|
---|
199 | * MS Powerpoint documents (.pptx) and (.ppt, .pps) if catppt is available (comes with catdoc)
|
---|
200 | * Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
|
---|
201 | * MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
|
---|
202 | * AbiWord documents (.abw)
|
---|
203 | * Compressed AbiWord documents (.zabw) if gzip is available
|
---|
204 | * Rich Text Format documents (.rtf) if unrtf is available
|
---|
205 | * Perl POD documentation (.pl, .pm, .pod) if pod2text is available
|
---|
206 | * TeX DVI files (.dvi) if catdvi is available
|
---|
207 | * DjVu files (.djv, .djvu) if djvutxt is available
|
---|
208 |
|
---|
209 | If you have additional extensions that represent one of these types, you need
|
---|
210 | to add an additional MIME mapping using the --mime-type option. For instance::
|
---|
211 |
|
---|
212 | $ omindex --db /var/lib/omega/data/default --url /press /www/example/press --mime-type doc:application/postscript
|
---|
213 |
|
---|
214 | The syntax of --mime-type is 'ext:type', where ext is the extension of
|
---|
215 | a file of that type (everything after the last '.'), and type is one
|
---|
216 | of:
|
---|
217 |
|
---|
218 | - text/html
|
---|
219 | - text/plain
|
---|
220 | - text/rtf
|
---|
221 | - text/x-perl
|
---|
222 | - application/msword
|
---|
223 | - application/pdf
|
---|
224 | - application/postscript
|
---|
225 | - application/vnd.ms-excel
|
---|
226 | - application/vnd.ms-powerpoint
|
---|
227 | - application/vnd.ms-works
|
---|
228 | - application/vnd.oasis.opendocument.text
|
---|
229 | - application/vnd.oasis.opendocument.spreadsheet
|
---|
230 | - application/vnd.oasis.opendocument.presentation
|
---|
231 | - application/vnd.oasis.opendocument.graphics
|
---|
232 | - application/vnd.oasis.opendocument.chart
|
---|
233 | - application/vnd.oasis.opendocument.formula
|
---|
234 | - application/vnd.oasis.opendocument.database
|
---|
235 | - application/vnd.oasis.opendocument.image
|
---|
236 | - application/vnd.oasis.opendocument.text-master
|
---|
237 | - application/vnd.oasis.opendocument.text-template
|
---|
238 | - application/vnd.oasis.opendocument.spreadsheet-template
|
---|
239 | - application/vnd.oasis.opendocument.presentation-template
|
---|
240 | - application/vnd.oasis.opendocument.graphics-template
|
---|
241 | - application/vnd.oasis.opendocument.chart-template
|
---|
242 | - application/vnd.oasis.opendocument.formula-template
|
---|
243 | - application/vnd.oasis.opendocument.image-template
|
---|
244 | - application/vnd.oasis.opendocument.text-web
|
---|
245 | - application/vnd.sun.xml.calc
|
---|
246 | - application/vnd.sun.xml.calc.template
|
---|
247 | - application/vnd.sun.xml.draw
|
---|
248 | - application/vnd.sun.xml.draw.template
|
---|
249 | - application/vnd.sun.xml.impress
|
---|
250 | - application/vnd.sun.xml.impress.template
|
---|
251 | - application/vnd.sun.xml.math
|
---|
252 | - application/vnd.sun.xml.writer
|
---|
253 | - application/vnd.sun.xml.writer.global
|
---|
254 | - application/vnd.sun.xml.writer.template
|
---|
255 | - application/vnd.wordperfect
|
---|
256 | - application/x-abiword
|
---|
257 | - application/x-abiword-compressed
|
---|
258 | - application/x-dvi
|
---|
259 | - image/vnd.djvu
|
---|
260 |
|
---|
261 | If you wish to remove a MIME mapping, you can do this by omitting the type -
|
---|
262 | for example to not index .doc files, use: --mime-type doc:
|
---|
263 |
|
---|
264 | The lookup of extensions in the MIME mappings is case sensitive, but if an
|
---|
265 | extension isn't found and includes upper case ASCII letters, they're converted
|
---|
266 | to lower case and the lookup is repeated, so you effectively get case
|
---|
267 | insensitive lookup for mappings specified with a lower-case extension, but
|
---|
268 | you can set different handling for differently cased variants if you need
|
---|
269 | to.
|
---|
270 |
|
---|
271 | --duplicates configures how omindex handles duplicates (detected on
|
---|
272 | URL). 'ignore' means to ignore a document if it already appears to be
|
---|
273 | in the database; 'replace' means to replace the document in the
|
---|
274 | database with a new one by indexing this file, and 'duplicate' means
|
---|
275 | to index this file as a new document, leaving the previous one in the
|
---|
276 | database as well. The last strategy is very fast, but is liable to do
|
---|
277 | strange things to your results set. In general, 'ignore' is useful for
|
---|
278 | completely static documents (e.g. archive sites), while 'replace' is
|
---|
279 | the most generally useful.
|
---|
280 |
|
---|
281 | With 'replace', omindex will remove any document it finds in the
|
---|
282 | database that it did not update - in other words, it will clear out
|
---|
283 | everything that doesn't exist any more. However if you are building up
|
---|
284 | an omega database with several runs of omindex, this is not
|
---|
285 | appropriate (as each run would delete the data from the previous run),
|
---|
286 | so you should use the --preserve-nonduplicates. Note that if you
|
---|
287 | choose to work like this, it is impossible to prune old documents from
|
---|
288 | the database using omindex. If this is a problem for you, an
|
---|
289 | alternative is to index each subsite into a different database, and
|
---|
290 | merge all the databases together when searching.
|
---|
291 |
|
---|
292 | --depth-limit allows you to prevent omindex from descending more than
|
---|
293 | a certain number of directories. If you wish to replicate the old
|
---|
294 | --no-recurse option, use ----depth-limit=1.
|
---|
295 |
|
---|
296 | HTML Parsing
|
---|
297 | ============
|
---|
298 |
|
---|
299 | The document ``<title>`` tag is used as the document title, the 'description'
|
---|
300 | META tag (if present) is used for the document snippet, and the 'keywords'
|
---|
301 | META tag (if present) is indexed as extra document text.
|
---|
302 |
|
---|
303 | The HTML parser will look for the 'robots' META tag, and won't index pages
|
---|
304 | which are marked as ``noindex`` or ``none``, for example any of the following::
|
---|
305 |
|
---|
306 | <meta name="robots" content="noindex,nofollow">
|
---|
307 | <meta name="robots" content="noindex">
|
---|
308 | <meta name="robots" content="none">
|
---|
309 |
|
---|
310 | The parser also understand ht://dig comments to mark sections of the document
|
---|
311 | to not index (for example, you can use this to avoid indexing navigation links
|
---|
312 | or standard headers/footers) - for example::
|
---|
313 |
|
---|
314 | Index this bit <!--htdig_noindex-->but <b>not</b> this<!--/htdig_noindex>
|
---|
315 |
|
---|
316 | Boolean terms
|
---|
317 | =============
|
---|
318 |
|
---|
319 | omindex will create the following boolean terms when it indexes a
|
---|
320 | document:
|
---|
321 |
|
---|
322 | T
|
---|
323 | MIME type
|
---|
324 | H
|
---|
325 | hostname of site (if supplied - this term won't exist if you index a
|
---|
326 | site with base URL '/press', for instance)
|
---|
327 | P
|
---|
328 | path of site (i.e. the rest of the site base URL)
|
---|
329 | U
|
---|
330 | full URL of indexed document - if the resulting term would be > 240
|
---|
331 | characters, a hashing scheme is used to prevent omindex overflowing
|
---|
332 | the Xapian term length limit.
|
---|
333 |
|
---|
334 |
|
---|
335 |
|
---|
336 | D
|
---|
337 | date (numeric format: YYYYMMDD)
|
---|
338 | date can also have the magical form "latest" - a document indexed
|
---|
339 | by the term Dlatest matches any date-range without an end date.
|
---|
340 | You can index dynamic documents which are always up to date
|
---|
341 | with Dlatest and they'll match as expected. (If you use sort by date,
|
---|
342 | you'll probably also want to set the value containing the timestamp to
|
---|
343 | a "max" value so dynamic documents match a date in the far future).
|
---|
344 | M
|
---|
345 | month (numeric format: YYYYMM)
|
---|
346 | Y
|
---|
347 | year (four digits)
|
---|
348 |
|
---|
349 | omega configuration
|
---|
350 | ===================
|
---|
351 |
|
---|
352 | Most of the omega CGI configuration is dynamic, by setting CGI
|
---|
353 | parameters. However some things must be configured using a
|
---|
354 | configuration file. The configuration file is searched for in
|
---|
355 | various locations:
|
---|
356 |
|
---|
357 | - Firstly, if the "OMEGA_CONFIG_FILE" environment variable is
|
---|
358 | set, its value is used as the full path to a configuration file
|
---|
359 | to read.
|
---|
360 | - Next (if the environment variable is not set, or the file pointed
|
---|
361 | to is not present), the file "omega.conf" in the same directory as
|
---|
362 | the Omega CGI is used.
|
---|
363 | - Next (if neither of the previous steps found a file), the file
|
---|
364 | "${sysconfdir}/omega.conf" (e.g. /etc/omega.conf on Linux systems)
|
---|
365 | is used.
|
---|
366 | - Finally, if no configuration file is found, default values are used.
|
---|
367 |
|
---|
368 | The format of the file is very simple: a line per option, with the
|
---|
369 | option name followed by its value, separated by a whitespace. Blank
|
---|
370 | lines are ignored. If the first non-whitespace character on a line
|
---|
371 | is a '#', omega treats the line as a comment and ignores it.
|
---|
372 |
|
---|
373 | The current options are 'database_dir' (the directory containing all the
|
---|
374 | Omega databases), 'template_dir' (the directory containing the OmegaScript
|
---|
375 | templates), and 'log_dir' (the directory which the OmegaScript $log command
|
---|
376 | writes log files to).
|
---|
377 |
|
---|
378 | The default values (used if no configuration file is found) are::
|
---|
379 |
|
---|
380 | database_dir /var/lib/omega/data
|
---|
381 | template_dir /var/lib/omega/templates
|
---|
382 | log_dir /var/log/omega
|
---|
383 |
|
---|
384 | Note that, with apache, environment variables may be set using mod_env, and
|
---|
385 | with apache 1.3.7 or later this may be used inside a .htaccess file. This
|
---|
386 | makes it reasonably easy to share a single system installed copy of Omega
|
---|
387 | between multiple users.
|
---|
388 |
|
---|
389 | Supplied Templates
|
---|
390 | ==================
|
---|
391 |
|
---|
392 | The OmegaScript templates supplied with Omega are:
|
---|
393 |
|
---|
394 | * query - This is the default template, providing a typical Web search
|
---|
395 | interface.
|
---|
396 | * topterms - This is just like query, but provides a "top terms" feature
|
---|
397 | which suggests terms the user might want to add to their query to
|
---|
398 | obtain better results.
|
---|
399 | * godmode - Allows you to inspect a database showing which terms index
|
---|
400 | each document, and which documents are indexed by each term.
|
---|
401 | * opensearch - Provides results in OpenSearch format (for more details
|
---|
402 | see http://www.opensearch.org/).
|
---|
403 | * xml - Provides results in a custom XML format.
|
---|
404 |
|
---|
405 | There are also "helper fragments" used by the templates above:
|
---|
406 |
|
---|
407 | * inc/anyalldropbox - Provides a choice of matching "any" or "all" terms
|
---|
408 | by default as a drop down box.
|
---|
409 | * inc/anyallradio - Provides a choice of matching "any" or "all" terms
|
---|
410 | by default as radio buttons.
|
---|
411 | * toptermsjs - Provides some JavaScript used by the topterms template.
|
---|
412 |
|
---|
413 | Document data construction
|
---|
414 | ==========================
|
---|
415 |
|
---|
416 | This is only useful if you need to inject your own documents into the
|
---|
417 | database independently of omindex, such as if you are indexing
|
---|
418 | dynamically-generated documents that are served using a server-side
|
---|
419 | system such as PHP or ASP, but which you can determine the contents of
|
---|
420 | in some way, such as documents generated from reasonably static
|
---|
421 | database contents.
|
---|
422 |
|
---|
423 | The document data field stores some summary information about the
|
---|
424 | document, in the following (sample) format::
|
---|
425 |
|
---|
426 | url=<baseurl>
|
---|
427 | sample=<sample>
|
---|
428 | caption=<title>
|
---|
429 | type=<mimetype>
|
---|
430 |
|
---|
431 | Further fields may be added (although omindex doesn't currently add any
|
---|
432 | others), and may be looked up from OmegaScript using the $field{}
|
---|
433 | command.
|
---|
434 |
|
---|
435 | As of Omega 0.9.3, you can alternatively add something like this near the
|
---|
436 | start of your OmegaScript template::
|
---|
437 |
|
---|
438 | $set{fieldnames,$split{caption sample url}}
|
---|
439 |
|
---|
440 | Then you need only give the field values in the document data, which can
|
---|
441 | save a lot of space in a large database. With the setting of fieldnames
|
---|
442 | above, the first line of document data can be accessed with $field{caption},
|
---|
443 | the second with $field{sample}, and the third with $field{url}.
|
---|