wiki:FAQ/OmegaNewFileFormat

How can I add support for a new file format to Omega?

Filter programs which produce UTF-8 plain text on stdout

Omega 1.2.4 added the ability to specify additional external filters on the command line, provided they they produce UTF-8 plain text on stdout - many existing filter programs fit within these limitations.

Omega 1.4.0 can handle output in other character sets, in HTML, and in a temporary file.

To specify an additional filter, use the --filter command-line option, for example:

--filter=application/octet-stream:'strings -n8'

This tells omindex to index files with content-type application/octet-stream by running:

strings -n8 /path/to/file

If you find a reliable external filter and think it might be useful to other people, please let us know about it.

Other external filter programs

If the filter program doesn't produce plain text or HTML output, or you need to run several programs for each file then you'll either need to put together a script which fits what omindex supports, or else modify the the source code in index_file.cc (in 1.2.x, it was in omindex.cc). It's quite easy to wire in additional external filter programs, or to add support for XML-based formats.

Below we will attempt to guide you through doing so. If you need further help, please ask on the mailing lists.

Warning: The process of adding a new file format has been simplified in Omega 1.4.x, but this FAQ entry hasn't yet been fully updated to reflect this.

The first job is to find a good external filter, or decide to parse XML formats inside index_file.cc. Another viable approach would be to convert using an external library, but there aren't yet existing examples of this amongst the existing supported formats.

Some formats have several filters to choose from. The attributes which interest us are reliably extracting the text with word breaks in the right places, and supporting Unicode (ideally as UTF-8). If you have several choices, try them on some sample files. It might be that outputting plain text directly isn't the best option - for example, for PostScript the only filters which produce plain text output we could find are limited to ISO-8859-1, so we currently convert PostScript to text via an intermediate PDF file.

The ideal (and simplest) case is that you have a filter which can produce UTF-8 output in plain text. It may need to be passed special command line options to do so, in which case work out what they are from the documentation or source code, and check that the output is indeed as documented.

We'll look at this simple case of UTF-8 plain text first (though note that as of Omega 1.2.4, this can be handled more simply by using the new --filter command line option), and then consider how to handle another encoding and/or output format.

You need a mime-type for your new format. The official registry for is at https://www.iana.org/assignments/media-types/ but not all filetypes have a corresponding official mime-type. In this case, a de-facto standard "x-" prefixed mime-type often exists. A good way to look for one is to ask the file utility to identify a file (Omega uses the same library as file to identify files when it doesn't recognise the extension):

file --mime-type example.patch

which reports:

example.patch: text/x-diff

So text/x-diff is probably a good choice here.

Sometimes file just returns a generic answer (most commonly text/plain or application/octet-stream) and occasionally it misidentifies a file. In these cases, you can try looking for the extension in /etc/mime.types on a modern Unix box. So for Python scripts which have extension py:

grep -w py /etc/mime.types

which reports:

text/x-python					py

So text/x-python is probably a good choice here.

You need to add an entry for this to the file mimemap.tokens:

py      text/x-python

If multiple extensions are used for a format (such as htm and html for HTML) then add an entry for each. When indexing a filename which has an extension in upper- or mixed-case, omindex will check for an exact match for the extension in mime_map, and if not found, it will force the extension to lower-case and try again, so just add the extension in lower-case unless different cases actually have different meanings.

Then add a test for the new mime-type to the long if/else-if chain. New formats should generally go at the end, unless they're very common, but don't fret about the order too much.

    } else if (mimetype == "text/x-python") { 

The filename of the file is in file. The code you add should at least extract the "body" text of the document into the C++ variable dump. Optionally, you can also set title (the document's title), keywords (additional text to index, but not to show the user), and sample (if set, this is used to generate the static document "snippet" which is stored; if not set, this is generated from dump).

It's most efficient if the filter program can write to stdout, but output to a temporary file can work to. For the stdout case, you would write:

        string cmd = "python2text --utf8 --stdout " + shell_protect(file);
        try {
            dump = stdout_to_string(cmd);
        } catch (ReadError) {
            cout << "\"" << cmd << "\" failed - skipping\n";
            return;
        }

The shell_protect() function escapes shell meta-characters in the filename. The stdout_to_string() function runs a command and captures its output as a C++ std::string. If the command isn't installed on PATH, omindex detects this automatically and disables support for the mimetype in the current run, so it will only try the first file of each such type.

If the filter can only produce a temporary file, then you would write:

        if (!ensure_tmpdir()) {
            cout << "Couldn't create temporary directory (" << strerror(errno) << ") - skipping" << endl;
            return;
        }
        string tmpfile = tmpdir + "/tmp.txt";
        string safetmp = shell_protect(tmpfile);
        string cmd = "python2text --utf8 --output=" + safetmp + " " + shell_protect(file);
        try {
            (void)stdout_to_string(cmd);
            dump = file_to_string(tmpfile);
            unlink(tmpfile.c_str());
        } catch (ReadError) {
            cout << "\"" << cmd << "\" failed - skipping\n";
            unlink(tmpfile.c_str());
            return;
        } catch (...) {
            unlink(tmpfile.c_str());
        }

If UTF-8 output isn't supported, pick the best (or only!) supported encoding and then convert the output to UTF-8 - to do this, once you have dump, convert it like so (replacing "ISO-8859-1" with the character set which is produced:

        convert_to_utf8(dump, "ISO-8859-1");

The character set name is either passed to iconv(), or if that's not available, by a simple internal conversion library which understands the most common encodings. If the output is purely 7-bit ASCII, there's no need to convert as ASCII is a subset of UTF-8 (so ASCII text is valid UTF-8 already).

If plain text isn't available, HTML output is easy to use (see text/rtf for an example of how to do this). Conversion using multiple filters is also possible (for an example, see application/postscript which uses ps2pdf followed by pdftotext to support Unicode PostScript files).

If your format is XML based, then you can probably subclass HtmlParser to pull out the contents of tags which are interesting. If you just need to strip tags, then XmlParser does exactly that. For a more complicated example, see MetaXmlParser which parses OpenDocument meta.xml files. If you want a complex example to study, look at the OpenDocument filtering which shows how to extract multiple files from a zip format archive, and parse them with different Xml parsers to produce dump, title, keywords, and sample.

Once you're happy your filter works, please submit a patch so we can include it in future releases (creating a new trac ticket and attaching the patch is best). Before doing so, please also update docs/overview.rst by:

  • Adding the format and extensions recognised for it to the list.
  • Adding the mime-type to the list.

It would be really useful if you are able to supply some sample files with a licence which allows redistribution so we can test the filtering. Ideally ones with non-ASCII characters so that we know Unicode support works.

FAQ Index

Last modified 6 years ago Last modified on 04/03/19 18:49:14
Note: See TracWiki for help on using the wiki.