ScriptindexExample – Xapian

Context Navigation

Scriptindex Example

This document describes in detail the generation of a Xapian index using scriptindex. It has been developed for Linux but should be portable to any other Unix like OS. Porting to a Windows platform may be difficult unless you have the cygwin environment. You should have a http server like Apache running on the system and a recent perl interpreter. This demo also requires that the xapian libraries and Omega utilities be installed in their default location. If this isn't true, you will have to adjust some of the script examples and programs accordingly. This has been tested on a Ubuntu 18.04 server and a Mint 20 desktop. It should work on most any recent Debian based system.

If you have installed xapian-omega as a binary from a package, such as rpm or deb, then the documentation for omega, scriptindex, et al will most likely be in /usr/share/doc/xapian-omega. If you installed from source, the files are in .../xapian-omega/docs where ... is the location you extracted the tar file to.

I chose perl to prepare the data for scriptindex. It's readily available, if you don't get too crazy, you can write code that even non-perl programmers can read and I'm fairly competent with the language.

First, download the sample files from https://fayettedigital.com/downloads/scriptindex_example.tar.gz. This file is about 7 Mb and has the data, scripts and other necessary files.

Untar the file into a convenient location and go to the scriptindex_example/ directory.

tar xf scriptindex_example.tar.gz 
cd scriptindex_example/
sudo cp -ra phil  /var/www/html/

The files in the txt directory are the ones we will be indexing. The files in the phil directory will need to be moved somewhere in your document root so we can display them via a web browser. I suggest placing the phil directory in the document root as shown above. Some systems might require that you change the owner/group of the phil directory to the owner specified in the configuration files for the httpd server you are using. That was not the case in the systems I tested. Since we will not be writing to the files in the phil directory, read only access is all that is necessary.

The files in the phil directory are in html format while the files in the txt directory are ascii text files. They have the same data, but different formats. E. g.:

$ ls -l phil|head -5
total 11743
-rw-r--r-- 1 jwl jwl  84186 Jan  2  2009 aristotle-categories-79.html
-rw-r--r-- 1 jwl jwl 734892 Jan  2  2009 aristotle-history-78.html
-rw-r--r-- 1 jwl jwl 629509 Jan  2  2009 aristotle-metaphysics-77.html
-rw-r--r-- 1 jwl jwl 239779 Jan  2  2009 aristotle-meteorology-80.html
$ ls -l txt|head -5
total 11166
-rw-r--r-- 1 jwl jwl  82870 Apr 17  2006 aristotle-categories-79.txt
-rw-r--r-- 1 jwl jwl 725332 Apr 17  2006 aristotle-history-78.txt
-rw-r--r-- 1 jwl jwl 623596 Apr 17  2006 aristotle-metaphysics-77.txt
-rw-r--r-- 1 jwl jwl 236975 Apr 17  2006 aristotle-meteorology-80.txt

In practice, you wouldn't want to have this duplication of files, but it makes the example easier to separate them.

The first thing we must do is index these documents. To do this we'll have to make a directory in which to place the index. If we look at the /usr/local/etc/omega.conf (or perhaps /etc/omega.conf if you installed the system repo files) file, we will see where omega expects to find the index files. Here are the contents of the default file:

database_dir /var/lib/xapian-omega/data
template_dir /var/lib/xapian-omega/templates
log_dir /var/log/xapian-omega
cdb_dir /var/lib/omega/xapian-cdb

Note: On Mint 20, the omega.conf file has entries to locations that don't exist. You may either edit the conf file and change the directory names to locations that do exist or create the directories in the conf file. Be sure to copy the query template and inc directory from xapian-omega-1.4.N/templates to your template directory.

Omega, which is what we'll be using to demonstrate later on, expects a database named default. Create a directory to hold the database files with the following commands. We also want to change the ownership so we do not have to index the files as the root user. Change joeuser to your user name.

sudo mkdir -p /var/lib/xapian-omega/data/default # Look in omega.conf for the path on your system. 
sudo chown joeuser /var/lib/xapian-omega/data/default
sudo chmod 755 /var/lib/xapian-omega/data/default

First let's talk about fields. With Xapian, we can break a document up into multiple fields. This then lets us restrict searches by asking for documents only containing search terms in a specific field. For instance we could ask for documents for a specific date, where the author's name field contains “John” and the title contains “spandex” and the body of the text (just another field, really) contains “ripples”. There is no practical limit to the number of fields but common sense dictates there will be a small number of interesting fields in any document set. Definition of the fields is entirely up to you. You only have to specify one field if you wish.

Another special kind of field is the boolean field. This one is used to hold a finite (and usually predictable) set of data. The use of boolean fields to subdivide your data is more efficient than general searches. Examples of these fields could be a field for gender (M or F), or a military rank or an auto make (Ford, Fiat, Mercedes, etc). When searching you can easily limit your search to the documents with a single or even a set of boolean field specifications. For instance you may want to only search for articles written by women, or only documents that relate in some way to Generals, Admirals and Majors.

Since we are going to be using scriptindex to do the indexing, we do have to tell it about the data we are going to send it. We will specify two files to scriptindex, an index file with field information and a data file with the actual field data.

The index file we will use for this example (test.index) is as follows:

url : field boolean=Q unique=Q
body : index
title: field=title index=S value=0
date: field=date date=yyyymmdd index=D value=1
author: field index=A value=2
translator: field index=XTRANS value=3

The first line describes the field, url, it is a boolean field and will be unique. This field will hold a reference to the document so when we do the search we will be able to create a reference to the file to pass on to the browser. The keyword "field" in that line indicates that the value is stored in the database. The Q from the boolean=Q is a prefix. The set of single letter prefixes is reserved by Omega and described in detail in the termprefixes.html file found in the docs directory for Omega.

The second line describes the field, body. This is the entire document and will be indexed but not stored in the database, i. e. no "field" keyword.

The third line describes the field, title. The title will be stored and the index terms will be identified as belonging to the title (index=S).

The other lines describe the remaining fields and have the same attributes as the title. More information about the format of this file can be found in scriptindex.html.

Now we're ready to index, but first let's look at the perl file we'll use to index the files, Oindex.pl. This perl script is expecting to find a list of file names on the command line.

foreach my $file (@ARGV) {
    open DB, $file;

Here we will process each file on the command line, in the array @ARGV, and open each. The first line of each of the text files is formatted like this:

DATE|AUTHOR|TRANSLATOR|TITLE

For convenience I added these lines. In real life, it may be a bit harder to extract the necessary fields from the document. The date field is bogus. I just generated a random date just so we would have a date field to demonstrate with. I'm sure neither Plato or Aristotle was around to write on those dates. The other fields are from the document and should be accurate. Here is a sample first line from one of the files:

20020516|Aristotle|A. S. L. Farquharson|ON THE MOTION OF ANIMALS

We'll split that line with the following perl code:

# $string now looks like DATE|AUTHOR|TRANSLATOR|TITLE
my ($date, $author, $tauthor, $title) = split(/\|/, $string);
$indexDocument .= "title=$title\n";
$indexDocument .= "author=$author\n";
$indexDocument .= "translator=$tauthor\n";
$indexDocument .= "date=$date\n";
$indexDocument .= "body=";

What we are doing here is starting to build the data file for scriptindex. The data file is comprised of a set of field definitions. Each definition starts with the name of the field followed by the field data separated by an = sign. If the data is longer than a single line, then the successive data lines must start with an = sign. Here is sample data:

url=/phil/aristotle-categories-79.txt
title=CATEGORIES
author=Aristotle
translator=E. M. Edghill
date=20041204
body=350 BC
=CATEGORIES
=by Aristotle
=translated by E. M. Edghill
=1
=Things are said to be named 'equivocally' when, though they have a
=common name, the definition corresponding with the name differs for
=each. Thus, a real man and a figure in a picture can both lay claim to
=the name 'animal'; yet these are equivocally so named, for, though
...

Scriptindex will process multiple data sets (data from multiple files) but each has to be separated by a single blank line. The variable $indexDocument will contain data from all the files we index. Scriptindex reads this data from a file, so we must write it to a file and then call scriptindex to process it. The next few lines show that happening:

open OUT, ">/tmp/tmp.dmp";
print OUT $indexDocument;
close OUT;
# Execute the scriptindex program. If you have placed scriptindex or the
# database somewhere else, be sure to adjust the next line.
# if you have installed from source, scriptindex might be located in the 
# /usr/local/bin directory

my $cmd ="/usr/bin/scriptindex /var/lib/xapian-omega/data/default ./test.index /tmp/tmp.dmp";
print `$cmd`;

For those of you unfamiliar with perl, the back ticks (`) cause the execution of the command in the variable $cmd.

To execute this perl script, placing the file names on the command line, we'll use this command from the command line prompt. Be sure you are not running as root.

cd scriptindex_example
find txt -print | xargs perl Oindex.pl

If it ran OK it should print out something like:

records (added, replaced, deleted) = (52, 0, 0)

Let's test it using Omega. If you've installed from a .deb or from recent sources then omega is in the /usr/lib/cgi-bin/omega/ directory. However in current releases of Apache the cgi mod is not installed and cgi file must have an extension of at least .cgi. Changes I had to make were: As root:

sudo a2enmod cgid
sudo cp /usr/lib/cgi-bin/omega/omega /usr/lib/cgi-bin/omega/omega.cgi
systemctl restart apache2

After installing version 1.4.18 to /usr/local, I discovered that omega is in the /usr/local/lib/xapian-omega/bin directory so you might have to look around for it.

Point your browser to http://localhost/cgi-bin/omega/omega.cgi and enter feast in the search window and press search. If you see a message that says something to the effect Omega can't open the default database, then there is a discrepancy between the omega.conf file and where we placed our database. First try:

ls /var/lib/xapian-omega/data/default

and be sure there are files there. Check to see that this directory is the same as the reference in /etc/omega.conf.

This should give you a quick idea of how to use scriptindex to index your data. I've tried to avoid errors, but if you spot something that's not right, please let me know. Drop an email to jim at the website where you downloaded the files from.

Jim.

Last modified 11 months ago Last modified on 29/09/24 12:43:50

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text