Opened 13 years ago

Last modified 20 months ago

#569 assigned defect

Generate omindex docs and code relating to file types

Reported by: Charles Owned by: Olly Betts
Priority: low Milestone: 2.0.0
Component: Omega Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

We should try to generate all the docs and code relating to file types from a common source to ensure they stay in step with one another.


Original description:

From the omindex man page:

-F, --filter=TYPE:CMD
    process files with MIME Content-Type TYPE using command CMD, which should produce UTF-8 text on stdout e.g. -Fapplica‐tion/octet-stream:'strings -n8

This could be understood to mean that omindex examines files to determine their MIME type (I understood it that way) but from Olly's posting, subject "Re: [Xapian-discuss] Tika 0.8 failure rates", date 5oct11:

By default, omindex currently uses a list of extension->MIME content-type mappings, and only consults the magic library for extensions it doesn't know. So any file with a .doc extension will be considered as application/msword (unless you run omindex with '--mime-type=doc:').

A note about this could be added to the omindex man page and referenced from the -F and -M options descriptions.

Attachments (1)

Omega MIME-extension table.odt (22.8 KB ) - added by Charles 13 years ago.
Table of Omega extensions -> MIME types etc.

Download all attachments as: .zip

Change History (16)

comment:1 by Olly Betts, 13 years ago

Milestone: 1.3.0
Status: newassigned
Summary: omindex man page -F text misleadingomindex --help text for -F misleading

The man page is generated from the --help output, so that's where it needs addressing.

comment:2 by Olly Betts, 13 years ago

Description: modified (diff)

Looking into this, I'm not really sure how to improve this - the option right above -F makes it pretty clear that the MIME type can comes from the file extension:

  -M, --mime-type=EXT:TYPE map file extension EXT to MIME Content-Type TYPE
                           (empty TYPE removes any MIME mapping for EXT)

And if we add too much verbiage here, it stops being so useful for quick reference and ends up just duplicating the main documentation.

Thoughts?

comment:3 by Charles, 13 years ago

How about:

  • Changing "process files with MIME Content-Type TYPE using command CMD" to "process files with the extension corresponding to MIME Content-Type TYPE using command CMD"
  • Adding the http://xapian.org/docs/omega/overview.html link

It might be helpful if the list of MIME types on the Omega overview page was replaced with a table giving both the MIME types and the corresponding extensions. In case you want to do so, the attached OOo Writer document contains a table with columns showing

  1. Extension
  2. MIME type
  3. Common language type description
  4. Omega built-in
  5. Omega external
  6. Notes, including any external command for common extensions

by Charles, 13 years ago

Table of Omega extensions -> MIME types etc.

comment:4 by Olly Betts, 13 years ago

The amended version is less correct though - the MIME type may come from the extension or from examining the file's contents.

Thanks for the table, but really it needs to be a patch against overview.rst so we can sanely apply it. I'm also worried it would be a maintenance headache to keep it in step with reality - maybe it would be better to generate it from the list in the source code.

in reply to:  3 comment:5 by Charles, 13 years ago

Ah -- I had misunderstood that omindex only derives MIME types by mapping from extensions. It was so? It has been changed?

If I understood exactly how omindex worked then I could be more helpful with suggestions. That is the issue; it is hard to find out how omindex works regards extensions and MIME types without inspecting the source code and that is a non-trivial task, not something many users are willing or able to do.

Keeping the help useful for quick reference is worthwhile but it would be helpful if the man page was a complete reference, or included a link to a complete reference.

How about having both the existing --help and a --help_verbose option? The man page could be generated from the --help_verbose source.

Regards the maintenance headache, there's a note at the bottom of the table giving a command to extract what can be extracted from the source code; some further text would have to be added to complete the table as is. I would be willing to work on automating that as far as practicable.

comment:6 by Olly Betts, 13 years ago

Since 1.2.4 (so just over a year):

Optionally use libmagic to detect MIME types for files for which we have no extension mapping, which allows us to handle files with a misleading extension, or no extension at all. (ticket#114)

It looks like the documentation wasn't updated for this change though. I will sort that out.

I favour keeping --help output and the man pages aligned and fairly brief, and leaving more in depth documentation in the .rst docs. There's certainly an argument for having more in the man pages than --help output, but that would mean having three different levels of documentation for each program and I think that will inevitably lead to more omissions in keeping the docs up to date.

comment:7 by Olly Betts, 13 years ago

Using libmagic to detect MIME types documented in trunk r16230.

comment:8 by Olly Betts, 13 years ago

Milestone: 1.3.01.3.x

comment:9 by Olly Betts, 9 years ago

Milestone: 1.3.x1.3.4

comment:10 by Olly Betts, 9 years ago

Text from end of attachment:

Notes on sources of information for the table:
1. General information for the table came from the "omindex operation" section of http://xapian.org/docs/omega/overview.html.
2. File name extension to "MIME type" mappings came from the omindex source code.  The latest version can be found at https://gitorious.org/xapian/xapian/blobs/master/xapian-applications/omega/omindex.cc in the mime_map definition which started on line 1026 at the time of writing.

The command used to prepare data copied from that page was:

grep -E -v '^([0-9])*[[:space:]]*$|^[[:space:]]*//' | sed -e 's/[[:space:]]*mime_map\["//' -e 's/"] = "/\t/' -e 's/"; \/\/ /\t/' -e 's/";$//'

The tabs in the output allowed it to conveniently be converted into a table in OpenOffice Writer and then copied and pasted into the table above.

I'm thinking it might be better to put the list of mappings into a new file and then generate both the C++ source and docs from that.

comment:11 by James Aylett, 9 years ago

I wonder if that should be a step towards having a system-level configuration file for default mappings, that can then be modified in a CLI invocation? (Would then require a new CLI option to list out the default mappings.)

comment:12 by Olly Betts, 9 years ago

Milestone: 1.3.41.4.x

[38189b9a81c5c4c47d56692660dbb749e83c00b4/git] added a file for the extension to mimetype mappings, which is used to generate code and docs. This means that these generated tables now exactly match the code (previously there was a risk they'd get out of step when a new type was added, or some handling changed).

We can't currently generate exactly what's in the table in the attachment - I'm not sure that's what we should aim for, but it would be nice to at least be able to say what's built in, and what needs which external commands. Not sure what "Automatic" means though...

But that part seems trickier to achieve - the external commands which are handled via Filter objects (in current git master, see index_add_default_filters() in index_file.cc) could easily be pulled out into a text file we generated code and docs from, but types handled internally (e.g. application/x-abiword) and those via external commands which need more complex handling (e.g. application/postscript) are harder to extract in this way.

But it doesn't make sense to hold up 1.4.0 for the rest of this - these changes are essentially implementation details.

james said:

I wonder if that should be a step towards having a system-level configuration file for default mappings, that can then be modified in a CLI invocation? (Would then require a new CLI option to list out the default mappings.)

Currently I'm generating code from this file at compile-time, rather than parsing it at run-time.

We could probably load such data from a config file instead, though it needs thought about what's actually useful - system-wide setting of filters isn't necessarily what's wanted, and a system-wide file would have to be root-owned (since it specifies commands that get implicitly run by any user invoking omindex).

comment:13 by Olly Betts, 9 years ago

Description: modified (diff)
Summary: omindex --help text for -F misleadingGenerate omindex docs and code relating to file types

I've amended the help text in [1c687618afcbc8e7163d3b8f15f0887c7cec71cc/git] and current master says (--filter has since gained support for character encodings other than UTF-8 and for HTML output):

  -M, --mime-type=EXT:TYPE  assume any file with extension EXT has MIME
                            Content-Type TYPE, instead of using libmagic
                            (empty TYPE removes any existing mapping for EXT)
  -F, --filter=M[,[T][,C]]:CMD
                            process files with MIME Content-Type M using
                            command CMD, which produces output (on stdout or
                            in a temporary file) with format T (Content-Type
                            or file extension; currently txt (default) or
                            html) in character encoding C (default: UTF-8).
                            E.g. -Fapplication/octet-stream:'strings -n8'
                            or -Ftext/x-foo,,utf-16:'foo2utf16 %f %t'

I think that deals with the original report, so retitling.

comment:14 by Olly Betts, 8 years ago

76bc9f770e654435be23fa9d2808815d46aad1c6 generates a table with columns for mime-type and extensions. I plan to backport that for 1.4.2.

comment:15 by Olly Betts, 20 months ago

Milestone: 1.4.x2.0.0
Priority: normallow
Version: 1.2.5git master

Reviewing the remaining thing raised here that isn't currently generated is the list starting:

  • HTML (.html, .htm, .shtml, .shtm, .xhtml, .xhtm)
  • PHP (.php) - our HTML parser knows to ignore PHP code
  • text files (.txt, .text)
  • SVG (.svg)
  • Compressed SVG (.svgz)
  • CSV (Comma-Separated Values) files (.csv)
  • PDF (.pdf) if pdftotext (comes with poppler or xpdf) or libpoppler (in particular libpoppler-glib-dev) are available
  • PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes with poppler or xpdf) or libpoppler (in particular libpoppler-glib-dev) are available

In the table in the .odt attachment, this information is in the same table that lists the extension to MIME content-type mappings. I worry though that the table ends up being too wide (in the .odt the "MIME type" column has had to be wrapped in many cases which makes it harder to read).

Also where there are multiple extensions for a MIME content-type the .odt has a row for each, repeating the other fields - this helps keep the width under control, but makes it harder to see at a glance which extensions are essentially the same type. Perhaps using column spans for the other fields would work for this (grid table support column spans in .rst).

Bumping the milestone as this isn't a blocker for the next release series.

Note: See TracTickets for help on using tickets.