Context Navigation

Back to Ticket #290

Ticket #290: overview.rst

File overview.rst, 18.8 KB (added by Frank J Bruzzaniti, 16 years ago)
I've updated the doco, not sure if it was ment to be a diff. If so let me know

Line
1	==============
2	Omega overview
3	==============
4
5	If you just want a very quick overview, you might prefer to read the
6	`quick-start guide <quickstart.html>`_.
7
8	Omega operates on a set of databases. Each database is created and updated
9	separately using either omindex or `scriptindex <scriptindex.html>`_. You can
10	search these databases (or any other Xapian database with suitable contents)
11	via a web front-end provided by omega, a CGI application. A search can also be
12	done over more than one database at once.
13
14	There are separate documents covering `CGI parameters <cgiparams.html>`_, the
15	`Term Prefixes <termprefixes.html>`_ which are conventionally used, and
16	`OmegaScript <omegascript.html>`_, the language used to define omega's web
17	interface. Omega ships with several OmegaScript templates and you can
18	use these, modify them, or just write your own. See the "Supplied Templates"
19	section below for details of the supplied templates.
20
21	Omega parses queries using the ``Xapian::QueryParser`` class - for the supported
22	syntax, see queryparser.html in the xapian-core documentation
23	- available online at: http://www.xapian.org/docs/queryparser.html
24
25	Term construction
26	=================
27
28	Documents within an omega database are stored with two types of terms:
29	those used for probabilistic searching (the CGI parameter 'P'), and
30	those used for boolean filtering (the CGI parameter 'B'). Boolean
31	terms start with an initial capital letter denoting the 'group' of the
32	term (e.g. 'M' for MIME type), while probabilistic terms are all
33	lower-case, and are also stemmed before adding to the
34	database.
35
36	The "english" stemmer is used by default - you can configure this for omindex
37	and scriptindex with "--stemmer LANGUAGE" (use 'none' to disable stemming, see
38	omindex --help for the list of accepted language names). At search time you
39	can configure the stemmer by adding $set{stemmer,LANGUAGE} to the top of you
40	OmegaScript template.
41
42	The two term types are used as follows when building the query:
43	B(oolean) terms with the same prefix are ORed together, with all the
44	different prefix groups being ANDed together. This is then FILTERed
45	against the P(robabilistic) terms. This will look something like::
46
47	[ FILTER ]
48	/ \
49	/ \
50	P-terms [ AND ]
51	/ \| ... \
52	/
53	[ OR ]
54	/ \| ... \
55	B(F,1) B(F,2)...B(F,n)
56
57	Where B(F,1) is the first boolean term with prefix F, and so on.
58
59	The intent here is to allow filtering on arbitrary (and, typically,
60	orthogonal) characteristics of the document. For instance, by adding
61	boolean terms "Ttext/html", "Ttext/plain" and "P/press" you would be
62	filtering the probabilistic search for only documents that are both in
63	the "/press" site and which are either of MIME type text/html or
64	text/plain. (See below for more information about sites.)
65
66	If there is no probabilistic query, the boolean filter is promoted to
67	be the query, and the weighting scheme is set to boolean. This has
68	the effect of applying the boolean filter to the whole database.
69
70	In order to add more boolean prefixes, you will need to alter the
71	``index_file()`` function in omindex.cc. Currently omindex adds several
72	useful ones, detailed below.
73
74	Probabilistic terms are constructed from the title, body and keywords
75	of a document. (Not all document types support all three areas of
76	text.) Title terms are stored with position data starting at 0, body
77	terms starting 100 beyond title terms, and keyword terms starting 100
78	beyond body terms. This allows queries using positional data without
79	causing false matches across the different types of term.
80
81	Sites
82	=====
83
84	Within a database, Omega supports multiple sites. These are recorded
85	using boolean terms (see 'Term construction', above) to allow
86	filtering on them.
87
88	Sites work by having all documents within them having a common base
89	URL. For instance, you might have two sites, one for your press area
90	and one for your product descriptions:
91
92	- \http://example.com/press/index.html
93	- \http://example.com/press/bigrelease.html
94	- \http://example.com/products/bigproduct.html
95	- \http://example.com/products/littleproduct.html
96
97	You could index all documents within \http://example.com/press/ using a
98	site of '/press', and all within \http://example.com/products/ using
99	'/products'.
100
101	Sites are also useful because omindex indexes documents through the
102	file system, not by fetching from the web server. If you don't have a
103	URL to file system mapping which puts all documents under one
104	hierarchy, you'll need to index each separate section as a site.
105
106	An obvious example of this is the way that many web servers map URLs
107	of the form <\http://example.com/~<username>/> to a directory within
108	that user's home directory (such as ~<username>/pub on a Unix
109	system). In this case, you can index each user's home page separately,
110	as a site of the form '/~<username>'. You can then use boolean
111	filters to allow people to search only a specific home page (or a
112	group of them), or omit such terms to search everyone's pages.
113
114	Note that the site specified when you index is used to build the
115	complete URL that the results page links to. Thus while sites will
116	typically want to be relative to the hostname part of the URL (e.g.
117	'/site' rather than '\http://example.com/site'), you can use them
118	to have a single search across several different hostnames. This will
119	still work if you actually store each distinct hostname in a different
120	database.
121
122	omindex operation
123	=================
124
125	omindex is fairly simple to use, for example::
126
127	omindex --db default --url http://example.com/ /var/www/example.com
128
129	For a full list of command line options supported, see ``man omindex``
130	or ``omindex --help``.
131
132	You must specify the database to index into (it's created if it doesn't
133	exist, but parent directories must exist). You will often also want to specify
134	the base URL (which is used as the site, and can be relative to the hostname -
135	starts '/' - or absolute - starts with a scheme, e.g.
136	'\http://example.com/products/'). If not specified, the base URL defaults to
137	``/``.
138
139	You also need to tell omindex which directory to index. This should be
140	either a single directory (in which case it is taken to be the
141	directory base of the entire site being indexed), or as two arguments,
142	the first being the directory base of the site being indexed, and the
143	second being a relative directory within that to index.
144
145	For instance, in the example above, if you separate your products by
146	size, you might end up with:
147
148	- \http://example.com/press/index.html
149	- \http://example.com/press/bigrelease.html
150	- \http://example.com/products/large/bigproduct.html
151	- \http://example.com/products/small/littleproduct.html
152
153	If the entire website is stored in the file system under the directory
154	/www/example, then you would probably index the site in two
155	passes, one for the '/press' site and one for the '/products' site. You
156	might use the following commands::
157
158	$ omindex -p --db /var/lib/omega/data/default --url /press /www/example/press
159	$ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products
160
161	If you add a new large products, but don't want to reindex the whole of
162	the products section, you could do::
163
164	$ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products large
165
166	and just the large products will be reindexed. You need to do it like that, and
167	not as::
168
169	$ omindex -p --db /var/lib/omega/data/default --url /products/large /www/example/products/large
170
171	because that would make the large products part of a new site,
172	'/products/large', which is unlikely to be what you want, as large
173	products would no longer come up in a search of the products
174	site. (Note that the --depth-limit option may come in handy if you have
175	sites '/products' and '/products/large', or similar.)
176
177	omindex has built-in support for indexing HTML, PHP, text files, and AbiWord
178	documents. It can also index a number of other formats using external
179	programs. Filter programs are run with CPU and memory limits to prevent a
180	runaway filter from blocking indexing of other files.
181
182	The following formats are currently supported (if you know of a reliable
183	filter which can extract text from another useful file format, please let us
184	know):
185
186	* HTML (.html, .htm, .shtml)
187	* PHP (.php) - our HTML parser knows to ignore PHP code
188	* text files (.txt, .text)
189	* PDF (.pdf) if pdftotext is available (comes with xpdf)
190	* PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes
191	with xpdf) are available
192	* OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm,
193	.sxw, .sxg, .stw) if unzip is available
194	* OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb,
195	.odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is
196	available
197	* MS Word documents (.docx) and (.doc, .dot) if antiword is available
198	* MS Excel documents (.xlsx) and (.xls, .xlb, .xlt) if xls2csv is available (comes with catdoc)
199	* MS Powerpoint documents (.pptx) and (.ppt, .pps) if catppt is available (comes with catdoc)
200	* Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
201	* MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
202	* AbiWord documents (.abw)
203	* Compressed AbiWord documents (.zabw) if gzip is available
204	* Rich Text Format documents (.rtf) if unrtf is available
205	* Perl POD documentation (.pl, .pm, .pod) if pod2text is available
206	* TeX DVI files (.dvi) if catdvi is available
207	* DjVu files (.djv, .djvu) if djvutxt is available
208
209	If you have additional extensions that represent one of these types, you need
210	to add an additional MIME mapping using the --mime-type option. For instance::
211
212	$ omindex --db /var/lib/omega/data/default --url /press /www/example/press --mime-type doc:application/postscript
213
214	The syntax of --mime-type is 'ext:type', where ext is the extension of
215	a file of that type (everything after the last '.'), and type is one
216	of:
217
218	- text/html
219	- text/plain
220	- text/rtf
221	- text/x-perl
222	- application/msword
223	- application/pdf
224	- application/postscript
225	- application/vnd.ms-excel
226	- application/vnd.ms-powerpoint
227	- application/vnd.ms-works
228	- application/vnd.oasis.opendocument.text
229	- application/vnd.oasis.opendocument.spreadsheet
230	- application/vnd.oasis.opendocument.presentation
231	- application/vnd.oasis.opendocument.graphics
232	- application/vnd.oasis.opendocument.chart
233	- application/vnd.oasis.opendocument.formula
234	- application/vnd.oasis.opendocument.database
235	- application/vnd.oasis.opendocument.image
236	- application/vnd.oasis.opendocument.text-master
237	- application/vnd.oasis.opendocument.text-template
238	- application/vnd.oasis.opendocument.spreadsheet-template
239	- application/vnd.oasis.opendocument.presentation-template
240	- application/vnd.oasis.opendocument.graphics-template
241	- application/vnd.oasis.opendocument.chart-template
242	- application/vnd.oasis.opendocument.formula-template
243	- application/vnd.oasis.opendocument.image-template
244	- application/vnd.oasis.opendocument.text-web
245	- application/vnd.sun.xml.calc
246	- application/vnd.sun.xml.calc.template
247	- application/vnd.sun.xml.draw
248	- application/vnd.sun.xml.draw.template
249	- application/vnd.sun.xml.impress
250	- application/vnd.sun.xml.impress.template
251	- application/vnd.sun.xml.math
252	- application/vnd.sun.xml.writer
253	- application/vnd.sun.xml.writer.global
254	- application/vnd.sun.xml.writer.template
255	- application/vnd.wordperfect
256	- application/x-abiword
257	- application/x-abiword-compressed
258	- application/x-dvi
259	- image/vnd.djvu
260
261	If you wish to remove a MIME mapping, you can do this by omitting the type -
262	for example to not index .doc files, use: --mime-type doc:
263
264	The lookup of extensions in the MIME mappings is case sensitive, but if an
265	extension isn't found and includes upper case ASCII letters, they're converted
266	to lower case and the lookup is repeated, so you effectively get case
267	insensitive lookup for mappings specified with a lower-case extension, but
268	you can set different handling for differently cased variants if you need
269	to.
270
271	--duplicates configures how omindex handles duplicates (detected on
272	URL). 'ignore' means to ignore a document if it already appears to be
273	in the database; 'replace' means to replace the document in the
274	database with a new one by indexing this file, and 'duplicate' means
275	to index this file as a new document, leaving the previous one in the
276	database as well. The last strategy is very fast, but is liable to do
277	strange things to your results set. In general, 'ignore' is useful for
278	completely static documents (e.g. archive sites), while 'replace' is
279	the most generally useful.
280
281	With 'replace', omindex will remove any document it finds in the
282	database that it did not update - in other words, it will clear out
283	everything that doesn't exist any more. However if you are building up
284	an omega database with several runs of omindex, this is not
285	appropriate (as each run would delete the data from the previous run),
286	so you should use the --preserve-nonduplicates. Note that if you
287	choose to work like this, it is impossible to prune old documents from
288	the database using omindex. If this is a problem for you, an
289	alternative is to index each subsite into a different database, and
290	merge all the databases together when searching.
291
292	--depth-limit allows you to prevent omindex from descending more than
293	a certain number of directories. If you wish to replicate the old
294	--no-recurse option, use ----depth-limit=1.
295
296	HTML Parsing
297	============
298
299	The document ``<title>`` tag is used as the document title, the 'description'
300	META tag (if present) is used for the document snippet, and the 'keywords'
301	META tag (if present) is indexed as extra document text.
302
303	The HTML parser will look for the 'robots' META tag, and won't index pages
304	which are marked as ``noindex`` or ``none``, for example any of the following::
305
306	<meta name="robots" content="noindex,nofollow">
307	<meta name="robots" content="noindex">
308	<meta name="robots" content="none">
309
310	The parser also understand ht://dig comments to mark sections of the document
311	to not index (for example, you can use this to avoid indexing navigation links
312	or standard headers/footers) - for example::
313
314	Index this bit <!--htdig_noindex-->but <b>not</b> this<!--/htdig_noindex>
315
316	Boolean terms
317	=============
318
319	omindex will create the following boolean terms when it indexes a
320	document:
321
322	T
323	MIME type
324	H
325	hostname of site (if supplied - this term won't exist if you index a
326	site with base URL '/press', for instance)
327	P
328	path of site (i.e. the rest of the site base URL)
329	U
330	full URL of indexed document - if the resulting term would be > 240
331	characters, a hashing scheme is used to prevent omindex overflowing
332	the Xapian term length limit.
333
334
335
336	D
337	date (numeric format: YYYYMMDD)
338	date can also have the magical form "latest" - a document indexed
339	by the term Dlatest matches any date-range without an end date.
340	You can index dynamic documents which are always up to date
341	with Dlatest and they'll match as expected. (If you use sort by date,
342	you'll probably also want to set the value containing the timestamp to
343	a "max" value so dynamic documents match a date in the far future).
344	M
345	month (numeric format: YYYYMM)
346	Y
347	year (four digits)
348
349	omega configuration
350	===================
351
352	Most of the omega CGI configuration is dynamic, by setting CGI
353	parameters. However some things must be configured using a
354	configuration file. The configuration file is searched for in
355	various locations:
356
357	- Firstly, if the "OMEGA_CONFIG_FILE" environment variable is
358	set, its value is used as the full path to a configuration file
359	to read.
360	- Next (if the environment variable is not set, or the file pointed
361	to is not present), the file "omega.conf" in the same directory as
362	the Omega CGI is used.
363	- Next (if neither of the previous steps found a file), the file
364	"${sysconfdir}/omega.conf" (e.g. /etc/omega.conf on Linux systems)
365	is used.
366	- Finally, if no configuration file is found, default values are used.
367
368	The format of the file is very simple: a line per option, with the
369	option name followed by its value, separated by a whitespace. Blank
370	lines are ignored. If the first non-whitespace character on a line
371	is a '#', omega treats the line as a comment and ignores it.
372
373	The current options are 'database_dir' (the directory containing all the
374	Omega databases), 'template_dir' (the directory containing the OmegaScript
375	templates), and 'log_dir' (the directory which the OmegaScript $log command
376	writes log files to).
377
378	The default values (used if no configuration file is found) are::
379
380	database_dir /var/lib/omega/data
381	template_dir /var/lib/omega/templates
382	log_dir /var/log/omega
383
384	Note that, with apache, environment variables may be set using mod_env, and
385	with apache 1.3.7 or later this may be used inside a .htaccess file. This
386	makes it reasonably easy to share a single system installed copy of Omega
387	between multiple users.
388
389	Supplied Templates
390	==================
391
392	The OmegaScript templates supplied with Omega are:
393
394	* query - This is the default template, providing a typical Web search
395	interface.
396	* topterms - This is just like query, but provides a "top terms" feature
397	which suggests terms the user might want to add to their query to
398	obtain better results.
399	* godmode - Allows you to inspect a database showing which terms index
400	each document, and which documents are indexed by each term.
401	* opensearch - Provides results in OpenSearch format (for more details
402	see http://www.opensearch.org/).
403	* xml - Provides results in a custom XML format.
404
405	There are also "helper fragments" used by the templates above:
406
407	* inc/anyalldropbox - Provides a choice of matching "any" or "all" terms
408	by default as a drop down box.
409	* inc/anyallradio - Provides a choice of matching "any" or "all" terms
410	by default as radio buttons.
411	* toptermsjs - Provides some JavaScript used by the topterms template.
412
413	Document data construction
414	==========================
415
416	This is only useful if you need to inject your own documents into the
417	database independently of omindex, such as if you are indexing
418	dynamically-generated documents that are served using a server-side
419	system such as PHP or ASP, but which you can determine the contents of
420	in some way, such as documents generated from reasonably static
421	database contents.
422
423	The document data field stores some summary information about the
424	document, in the following (sample) format::
425
426	url=<baseurl>
427	sample=<sample>
428	caption=<title>
429	type=<mimetype>
430
431	Further fields may be added (although omindex doesn't currently add any
432	others), and may be looked up from OmegaScript using the $field{}
433	command.
434
435	As of Omega 0.9.3, you can alternatively add something like this near the
436	start of your OmegaScript template::
437
438	$set{fieldnames,$split{caption sample url}}
439
440	Then you need only give the field values in the document data, which can
441	save a lot of space in a large database. With the setting of fieldnames
442	above, the first line of document data can be accessed with $field{caption},
443	the second with $field{sample}, and the third with $field{url}.

Download in other formats:

Original Format