How can I add support for a new file format to Omega?
Filter programs which produce UTF-8 plain text on stdout
Omega 1.2.4 added the ability to specify additional external filters on the command line, provided they they produce UTF-8 plain text on stdout - many existing filter programs fit within these limitations.
Omega 1.4.0 can handle output in other character sets, in HTML, and in a temporary file.
To specify an additional filter, use the --filter
command-line
option, for example:
--filter=application/octet-stream:'strings -n8'
This tells omindex to index files with content-type application/octet-stream by running:
strings -n8 /path/to/file
If you find a reliable external filter and think it might be useful to other people, please let us know about it.
Other external filter programs
If the filter program doesn't produce plain text or HTML output, or
you need to run several programs for each file then you'll either need to
put together a script which fits what omindex supports, or else modify the
the source code in index_file.cc
(in 1.2.x, it was in omindex.cc
).
It's quite easy to wire in additional external filter programs, or
to add support for XML-based formats.
Below we will attempt to guide you through doing so. If you need further help, please ask on the mailing lists.
Warning: The process of adding a new file format has been simplified in Omega 1.4.x, but this FAQ entry hasn't yet been fully updated to reflect this.
The first job is to find a good external filter, or decide to parse
XML formats inside index_file.cc
. Another viable approach would
be to convert using an external library, but there aren't yet
existing examples of this amongst the existing supported formats.
Some formats have several filters to choose from. The attributes which interest us are reliably extracting the text with word breaks in the right places, and supporting Unicode (ideally as UTF-8). If you have several choices, try them on some sample files. It might be that outputting plain text directly isn't the best option - for example, for PostScript the only filters which produce plain text output we could find are limited to ISO-8859-1, so we currently convert PostScript to text via an intermediate PDF file.
The ideal (and simplest) case is that you have a filter which can produce UTF-8 output in plain text. It may need to be passed special command line options to do so, in which case work out what they are from the documentation or source code, and check that the output is indeed as documented.
We'll look at this simple case of UTF-8 plain text first (though note
that as of Omega 1.2.4, this can be handled more simply by using
the new --filter
command line option), and then
consider how to handle another encoding and/or output format.
You need a mime-type for your new format. The official registry for
is at https://www.iana.org/assignments/media-types/ but not all filetypes
have a corresponding official mime-type. In this case, a de-facto
standard "x-" prefixed mime-type often exists. A good way to look for one
is to ask the file
utility to identify a file (Omega uses the same
library as file
to identify files when it doesn't recognise the
extension):
file --mime-type example.patch
which reports:
example.patch: text/x-diff
So text/x-diff
is probably a good choice here.
Sometimes file
just returns a generic answer (most commonly
text/plain
or application/octet-stream
) and occasionally it
misidentifies a file. In these cases, you can try looking for the extension
in /etc/mime.types
on a modern Unix box. So for Python scripts which
have extension py
:
grep -w py /etc/mime.types
which reports:
text/x-python py
So text/x-python
is probably a good choice here.
You need to add an entry for this to the file mimemap.tokens
:
py text/x-python
If multiple extensions are used for a format (such as htm
and
html
for HTML) then add an entry for each. When indexing a filename
which has an extension in upper- or mixed-case, omindex
will check for
an exact match for the extension in mime_map
, and if not found,
it will force the extension to lower-case and try again, so just add
the extension in lower-case unless different cases actually have
different meanings.
Then add a test for the new mime-type to the long if/else-if chain. New formats should generally go at the end, unless they're very common, but don't fret about the order too much.
} else if (mimetype == "text/x-python") {
The filename of the file is in file
. The code you add should at
least extract the "body" text of the document into the C++ variable
dump
. Optionally, you can also set title
(the document's
title), keywords
(additional text to index, but not to show the
user), and sample
(if set, this is used to generate the static
document "snippet" which is stored; if not set, this is generated
from dump
).
It's most efficient if the filter program can write to stdout
, but
output to a temporary file can work to. For the stdout
case, you
would write:
string cmd = "python2text --utf8 --stdout " + shell_protect(file); try { dump = stdout_to_string(cmd); } catch (ReadError) { cout << "\"" << cmd << "\" failed - skipping\n"; return; }
The shell_protect()
function escapes shell meta-characters in the
filename. The stdout_to_string()
function runs a command and
captures its output as a C++ std::string
. If the command isn't
installed on PATH
, omindex detects this automatically and disables
support for the mimetype in the current run, so it will only try the
first file of each such type.
If the filter can only produce a temporary file, then you would write:
if (!ensure_tmpdir()) { cout << "Couldn't create temporary directory (" << strerror(errno) << ") - skipping" << endl; return; } string tmpfile = tmpdir + "/tmp.txt"; string safetmp = shell_protect(tmpfile); string cmd = "python2text --utf8 --output=" + safetmp + " " + shell_protect(file); try { (void)stdout_to_string(cmd); dump = file_to_string(tmpfile); unlink(tmpfile.c_str()); } catch (ReadError) { cout << "\"" << cmd << "\" failed - skipping\n"; unlink(tmpfile.c_str()); return; } catch (...) { unlink(tmpfile.c_str()); }
If UTF-8 output isn't supported, pick the best (or only!) supported
encoding and then convert the output to UTF-8 - to do this, once
you have dump
, convert it like so (replacing "ISO-8859-1" with
the character set which is produced:
convert_to_utf8(dump, "ISO-8859-1");
The character set name is either passed to iconv()
, or if that's
not available, by a simple internal conversion library which understands
the most common encodings. If the output is purely 7-bit ASCII, there's no need
to convert as ASCII is a subset of UTF-8 (so ASCII text is valid UTF-8
already).
If plain text isn't available, HTML output is easy to use (see text/rtf
for an example of how to do this). Conversion using multiple filters is also
possible (for an example, see application/postscript
which uses
ps2pdf
followed by pdftotext
to support Unicode PostScript
files).
If your format is XML based, then you can probably subclass
HtmlParser
to pull out the contents of tags which are interesting.
If you just need to strip tags, then XmlParser
does exactly that.
For a more complicated example, see MetaXmlParser
which parses
OpenDocument meta.xml
files. If you want a complex example to
study, look at the OpenDocument filtering which shows how to extract
multiple files from a zip
format archive, and parse them with
different Xml parsers to produce dump
, title
, keywords
,
and sample
.
Once you're happy your filter works, please submit a patch so we can
include it in future releases (creating a new trac ticket
and attaching the patch is best). Before doing so, please also update
docs/overview.rst
by:
- Adding the format and extensions recognised for it to the list.
- Adding the mime-type to the list.
It would be really useful if you are able to supply some sample files with a licence which allows redistribution so we can test the filtering. Ideally ones with non-ASCII characters so that we know Unicode support works.