How can I add support for a new file format to Omega?

There isn't currently a way to configure additional filters without modifying the source code in omindex.cc (ideally there should be a configuration file to allow this, but that's not implemented yet), but it's quite easy to wire in additional external filter programs, or to add support for XML-based formats.

Below we will attempt to guide you through doing so. If you need further help, please ask on the mailing lists.

The first job is to find a good external filter, or decide to parse XML formats inside omindex.cc. Another viable approach would be to convert using an external library, but there aren't yet existing examples of this amongst the existing supported formats.

Some formats have several filters to choose from. We're more interested in extracting the text with word breaks in the right places, and supporting Unicode (ideally as UTF-8). If you have several choices, try them on some sample files. It might be that outputting plain text directly isn't the best option - for example, for PostScript we could only find filters to support ISO-8859-1 with text output, so we currently convert PostScript to text via an intermediate PDF file.

The ideal (and simplest) case is that you have a filter which can produce UTF-8 output in plain text. It may need to be passed special command line options to do so, in which case work out what they are from the documentation or source code, and check that the output is indeed as documented.

We'll look at this simple case of UTF-8 plain text first, and then consider how to handle another encoding and/or output format.

You need a mime-type for your new format. The official registry for is at http://www.iana.org/assignments/media-types/ but not all filetypes have a corresponding official mime-type. In this case, a de-facto standard "x-" prefixed mime-type often exists. A good way to look for one is take the extension and look for it in /etc/mime.types on a modern Unix box. So for Python scripts which have extension py:

grep -w py /etc/mime.types

Which reports:

text/x-python					py

So text/x-python is probably a good choice here.

You need to add an entry for this to mime_map:

    mime_map["py"] = "text/x-python";

If multiple extensions are used for a format (such as htm and html for HTML), add an entry for each. If a filename has an extension in upper- or mixed-case, omindex will check for an exact match for the extension in mime_map, and if not found, it will force the extension to lower-case and try again, so just add the extension in lower-case unless different cases actually have different meanings.

Then add a test for the new mime-type to the long if/else-if chain. New formats should generally go at the end, unless they're very common, but don't fret about the order too much.

    } else if (mimetype == "text/x-python") { 

The filename of the file is in file. The code you add should at least extract the "body" text of the document into the C++ variable dump. Optionally, you can also set title (the document's title), keywords (additional text to index, but not to show the user), and sample (if set, this is used to generate the static document "snippet" which is stored; if not set, this is generated from dump).

It's most efficient if the filter program can write to stdout, but output to a temporary file can work to. For the stdout case, you would write:

        string cmd = "python2text --utf8 --stdout " + shell_protect(file);
        try {
            dump = stdout_to_string(cmd);
        } catch (ReadError) {
	    cout << "\"" << cmd << "\" failed - skipping\n";
	    return;
	}

The shell_protect() function escapes shell meta-characters in the filename. The stdout_to_string() function runs a command and captures its output as a C++ std::string. If the command isn't installed on PATH, omindex detects this automatically and disables support for the mimetype in the current run, so it will only try the first file of each such type.

If the filter can only produce a temporary file, then you would write:

        if (!ensure_tmpdir()) {
	    cout << "Couldn't create temporary directory (" << strerror(errno) << ") - skipping" << endl;
	    return;
	}
	string tmpfile = tmpdir + "/tmp.txt";
	string safetmp = shell_protect(tmpfile);
        string cmd = "python2text --utf8 --output=" + safetmp + " " + shell_protect(file);
        try {
            (void)stdout_to_string(cmd);
            dump = file_to_string(tmpfile);
            unlink(tmpfile.c_str());
        } catch (ReadError) {
	    cout << "\"" << cmd << "\" failed - skipping\n";
            unlink(tmpfile.c_str());
	    return;
	} catch (...) {
            unlink(tmpfile.c_str());
        }

If UTF-8 output isn't supported, pick the best (or only!) supported encoding and then convert the output to UTF-8 - to do this, once you have dump, convert it like so:

        convert_to_utf8(dump, "ISO-8859-1");

The character set name is either passed to iconv(), or if that's not available, by a simple internal conversion library which understands the most common encodings. If the output is purely ASCII, there's no need to convert as ASCII is a subset of UTF-8.

If plain text isn't available, HTML output is easy to use (see text/rtf for an example of how to do this). Conversion using multiple filters is also possible (for an example, see application/postscript which uses ps2pdf followed by pdftotext to support Unicode PostScript files).

If your format is XML based, then you can probably subclass HtmlParser to pull out the contents of tags which are interesting. If you just need to strip tags, then XmlParser does exactly that. For a more complicated example, see MetaXmlParser which parses OpenDocument meta.xml files. If you want a complex example to study, look at the OpenDocument filtering which shows how to extract multiple files from a zip format archive, and parse them with different Xml parsers to produce dump, title, keywords, and sample.

Once you're happy your filter works, please submit a patch so we can include it in future releases (creating a new trac ticket and attaching the patch is best). Before doing so, please also update docs/overview.rst by:

  • Adding the format and extensions recognised for it to the list.
  • Adding the mime-type to the list.

It would be really useful if you are able to supply some sample files with a licence which allows redistribution so we can test the filtering. Ideally ones with non-ASCII characters so that we know Unicode support works.

FAQ Index