wiki:ScriptindexExample

Scriptindex Example

This document describes in detail the generation of a Xapian index using scriptindex. It has been developed for Linux but should be portable to any other Unix like OS. Porting to a Windows platform may be difficult unless you have the cygwin environment. You should have a http server like Apache running on the system and a recent perl interpreter. This demo also requires that the xapian libraries and Omega utilities be installed in their default location. If this isn't true, you will have to adjust some of the script examples and programs accordingly.

If you have installed xapian-omega as a binary from a package, such as rpm or deb, then the documentation for omega, scriptindex, et al will most likely be in /usr/share/doc/xapian-omega. If you installed from source, the files are in .../xapian-omega/docs where ... is the location you extracted the tar file to.

I chose perl to prepare the data for scriptindex. It's readily available, if you don't get too crazy, you can write code that even non-perl programmers can read and I'm fairly competent with the language.

First, download the sample files from http://fayettedigital.com/downloads/scriptindex_example.tar.gz. This file is about 7 Mb and has the data, scripts and other necessary files.

Untar the file into a convenient location and go to the scriptindex_example/ directory. The files in the txt directory are the ones we will be indexing. The files in the phil directory will be moved somewhere in your document root so we can display them via a web browser. I suggest placing the phil directory in the document root thusly.

sudo mv phil /var/www/html

In practice, you wouldn't want to have this duplication of files, but it makes the example easier to separate them.

The first thing we must do is index these documents. To do this we'll have to make a directory in which to place the index. If we look at the /usr/local/etc/omega.conf (or perhaps /etc/omega.conf) file, we will see where omega expects to find the index files. Here are the contents of the default file:

database_dir /var/lib/xapian-omega/data
template_dir /var/lib/xapian-omega/templates
log_dir /var/log/xapian-omega
cdb_dir /var/lib/omega/xapian-cdb

Omega, which is what we'll be using to demonstrate later on, expects a database named default. Create a directory to hold the database files with the following command. We also want to change the ownership so we do not have to index the files as the root user. Change joeuser to your user name.

sudo mkdir -p /var/lib/xapian-omega/data/default
sudo chown joeuser /var/lib/xapian-omega/data/default
sudo chmod 755 /var/lib/xapian-omega/data/default

First let's talk about fields. With Xapian, we can break a document up into multiple fields. This then lets us restrict searches by asking for documents only containing search terms in a specific field. For instance we could ask for documents for a specific date, where the author's name field contains “John” and the title contains “spandex” and the body of the text (just another field, really) contains “ripples”. There is no practical limit to the number of fields but common sense dictates there will be a small number of interesting fields in any document set. Definition of the fields is entirely up to you. You only have to specify one field if you wish.

Another special kind of field is the boolean field. This one is used to hold a finite (and usually predictable) set of data. The use of boolean fields to subdivide your data is more efficient than general searches. Examples of these fields could be a field for gender (M or F), or a military rank or an auto make (Ford, Fiat, Mercedes, etc). When searching you can easily limit your search to the documents with a single or even a set of boolean field specifications. For instance you may want to only search for articles written by women, or only documents that relate in some way to Generals, Admirals and Majors.

Since we are going to be using scriptindex to do the indexing, we do have to tell it about the data we are going to send it. We will specify two files to scriptindex, an index file with field information and a data file with the actual field data.

The index file we will use for this example is as follows:

url : field boolean=Q unique=Q
body : index
title: field=title index=S value=0
date: field=date date=yyyymmdd index=D value=1
author: field index=A value=2
translator: field index=XTRANS value=3

The first line describes the field, url, it is a boolean field and will be unique. This field will hold a reference to the document so when we do the search we will be able to create a reference to the file to pass on to the browser. The keyword "field" in that line indicates that the value is stored in the database. The Q from the boolean=Q is a prefix. The set of single letter prefixes is reserved by Omega and described in detail in the termprefixes.html file found in the docs directory for Omega.

The second line describes the field, body. This is the entire document and will be indexed but not stored in the database, i. e. no "field" keyword.

The third line describes the field, title. The title will be stored and the index terms will be identified as belonging to the title.

The other lines describe the remaining fields and have the same attributes as the title. More information about the format of this file can be found in scriptindex.html.

Now we're ready to index, but first let's look at the perl file we'll use to index the files, Oindex.pl. This perl script is expecting to find a list of file names on the command line.

foreach my $file (@ARGV) {
    open DB, $file;

Here we will process each file on the command line, in the array @ARGV, and open each. The first line of each of the text files is formatted like this:

DATE|AUTHOR|TRANSLATOR|TITLE

For convenience I added these lines. In real life, it may be a bit harder to extract the necessary fields from the document. The date field is bogus. I just generated a random date just so we would have a date field to demonstrate with. I'm sure neither Plato or Aristotle was around to write on those dates. The other fields are from the document and should be accurate. Here is a sample first line from one of the files:

20020516|Aristotle|A. S. L. Farquharson|ON THE MOTION OF ANIMALS

We'll split that line with the following perl code:

# $string now looks like DATE|AUTHOR|TRANSLATOR|TITLE
my ($date, $author, $tauthor, $title) = split(/\|/, $string);
$indexDocument .= "title=$title\n";
$indexDocument .= "author=$author\n";
$indexDocument .= "translator=$tauthor\n";
$indexDocument .= "date=$date\n";
$indexDocument .= "body=";

What we are doing here is starting to build the data file for scriptindex. The data file is comprised of a set of field definitions. Each definition starts with the name of the field followed by the field data separated by an = sign. If the data is longer than a single line, then the successive data lines must start with an = sign. Here is sample data:

url=/phil/aristotle-categories-79.txt
title=CATEGORIES
author=Aristotle
translator=E. M. Edghill
date=20041204
body=350 BC
=CATEGORIES
=by Aristotle
=translated by E. M. Edghill
=1
=Things are said to be named 'equivocally' when, though they have a
=common name, the definition corresponding with the name differs for
=each. Thus, a real man and a figure in a picture can both lay claim to
=the name 'animal'; yet these are equivocally so named, for, though
...

Scriptindex will process multiple data sets (data from multiple files) but each has to be separated by a single blank line. The variable $indexDocument will contain data from all the files we index. Scriptindex reads this data from a file, so we must write it to a file and then call scriptindex to process it. The next few lines show that happening:

open OUT, ">/tmp/tmp.dmp";
print OUT $indexDocument;
close OUT;
# Execute the scriptindex program. If you have placed scriptindex or the
# database somewhere else, be sure to adjust the next line.
# if you have installed from source, scriptindex will be located in the 
# /usr/local/bin directory

my $cmd ="/usr/bin/scriptindex /var/lib/xapian-omega/data/default ./test.index /tmp/tmp.dmp";
print `$cmd`;

For those of you unfamiliar with perl, the back ticks (`) cause the execution of the command in the variable $cmd.

To execute this perl script, placing the file names on the command line, we'll use this command from the command line prompt. Be sure you are not running as root.

cd scriptindex_example
find txt -print | xargs perl Oindex.pl

If it ran OK it should print out something like:

records (added, replaced, deleted) = (52, 0, 0)

Let's test it using Omega. If you've installed from a .deb or from recent sources then omega is in the /usr/lib/cgi-bin/omega/ directory. Point your browser to http://localhost/cgi-bin/omega/omega and enter feast in the search window and press search. If you see a message that says something to the effect Omega can't open the default database, then there is a discrepancy between the omega.conf file and where we placed our database. First try:

ls /var/lib/xapian-omega/data/default

and be sure there are files there. Check to see that this directory is the same as the reference in /etc/omega.conf.

This should give you a quick idea of how to use scriptindex to index your data. I've tried to avoid errors, but if you spot something that's not right, please let me know. Go to the http://fayettedigital.com web site and click on the contact menu item.

Jim.

Last modified 4 months ago Last modified on 23/04/17 15:22:53