Ticket #114 (assigned enhancement)

Opened 22 months ago

Last modified 3 weeks ago

Use libmagic or libextractor instead of own MIME mappings and extractions

Reported by: nemesis Owned by: olly
Priority: normal Milestone: 1.1.1
Component: Omega Version: SVN trunk
Severity: minor Keywords:
Cc: Blocked By:
Operating System: All Blocking:

Description (last modified by olly) (diff)

Hello,

I locally first modified omindex to use libmagic's MIME database, instead of hard coding the MIME type to file extension mapping. This ensures that the internally used MIME types are more consistent with accepted standard types.

Then I went further and instead of using file extensions to determine type, used libmagic to fingerprint the files. This is slower, but ensures that the file actually is identified correctly even if the extension is wrong.

Now I am using libextractor to actually extract the metadata from the file, instead of calling these external programs inside omindex based on the MIME type. Using libextractor greatly simplifies omindex.

Is anyone interested in these modifications?

Attachments

libextractor.patch (4.9 kB) - added by nemesis 22 months ago.
patch to use libmagic and libextractor

Change History

Changed 22 months ago by olly

Yes, these sound interesting. I'd actually had a quick look at libextractor already and bookmarked it as "interesting", and I've also been wondering about using something like libmagic to decide what the encoding of text files is...

I think testing with libmagic probably should be optional. It seems a useful feature, but in many situations I'd certainly be happy to assume that the file extensions are correct in exchange for faster indexing.

Changed 22 months ago by nemesis

Using libextractor means that libmagic is unnecessary. libextractor plugins claim the file content themselves if it is correct. There is one problem with using libextractor currently, the API gives no way to determine the difference between whether a file was ignored, or it had an empty keyword set. So omindex would be inefficient, continuing to try to extract file types for which no extractor plugin exists. I have emailed the libextractor maintainers about that issue. In the meantime, perhaps it is a good idea to phase in optional libextractor support. I will see about writing the appropriate autoconf stuff to detect its presence and use it if the user enabled it.

Changed 22 months ago by nemesis

patch to use libmagic and libextractor

Changed 22 months ago by nemesis

This is a horrible hack, but you get the idea. A better setup would not bother with fileext/mimetypes that are known already to have no extractors available.

Changed 22 months ago by nemesis

Oh, you also need to check for -lmagic and -lextractor in the configure script. I hard coded them into the makefile for now.

Changed 22 months ago by olly

  • priority changed from highest to normal
  • rep_platform changed from PC to All
  • status changed from new to assigned

Doesn't look too bad!

It would be better to avoid stringstream which isn't available on some compilers we try to support - in this case it can just be replaced by appending directly to "dump", which is probably clearer and at least as efficient anyway.

This is quite a major structural change, so I think this is probably something to look at merging after 1.0 is out. My aim is to have that released at some point in April, so I'm trying to avoid changes with the potential to destabilise at the moment.

Does libmagic detect the character set of text files usefully? I played with GNU file and that is able to at least say "ASCII", "ISO-8859" or "UTF-8", which certainly beats assuming that all text files are UTF-8. If so, that part would certainly be useful for 1.0.

Changed 22 months ago by trac

  • platform set to All

Changed 22 months ago by nemesis

Yes it can detect the charset, that's why I use strncmp() on the MIME string, because something like text/plain; charset=us-ascii would be output otherwise, befuddling the check.

Waiting for 1.0 will be fine, since it's unclear if libextractor upstream will move quickly on my request for the API change.

Changed 22 months ago by nemesis

I also noticed that I moved the md5 calculation block to the wrong place, it should be performed unconditionally.

Changed 22 months ago by nemesis

Ok, I had a discussion with libextractor maintainer. His problem with my idea is that there is no way to tell simply based on a file extension or MIME type that the extractor plugin will be able to handle that type (consider different versions of a file specification like PDF), so he doesn't want to change the API to do something that he feels is stupid. He did point out that libextractor only opens and mmaps the file before going through the plugins, so I/O is minimal. But since there is no way to tell if libextractor is going to be able to handle the particular file we feed to it, we would always waste the open.

I see one option besides taking this speed hit (which I believe forcing upon the user would be contrary to the design of omindex, since that was the whole point of removing file extensions from the map that are not handled by index_file).

This would be to map MIME types directly to libextractor plugins.

The maintainer guarantees that the name of libextractor plugins is static. So we have the filename-to-MIME-type-map, to save the open if the user doesn't want to use libmagic (libmagic for more accurate MIME type identification).

Then we add a MIME-type-to-libextractor-plugin map, so that we check the MIME type of a file passed to index_file, and call libextractor with an ExtractorList? only including the plugin for that one file's MIME type.

Drawbacks: - Requires a priori knowledge of what plugins libextractor currently has in order to add new ones, but it shouldn't change that frequently.

- If the file extension is wrong, mime_map is wrong, or libmagic screws up fingerprinting the file, we extract empty keyword set because the wrong libextractor plugin is called. But that is already the case with omindex because it currently depends on the file extension being correct.

So, I think it will work. If you think this approach works, I'll code it.

Changed 22 months ago by olly

It seems it would still be useful to know if libextractor failed to extract keywords, or if the file simply has no meta-data, since in the former case you might want to try running other "convert to text" libraries or utilities, whereas in the later case that would probably be a waste of time.

But I do see where the libextractor maintainer is coming from.

Drawbacks: - Requires a priori knowledge of what plugins libextractor currently has in order to add new ones, but it shouldn't change that frequently.

Yeah, I don't see that as a huge issue. Essentially libextractor is a swiss army knife text extractor, and adding a new format it supports is conceptually similar to adding support for a new filter program or library (but less work!)

- If the file extension is wrong, mime_map is wrong, or libmagic screws up fingerprinting the file, we extract empty keyword set because the wrong libextractor plugin is called. But that is already the case with omindex because it currently depends on the file extension being correct.

I don't recall anyone complaining before about basing the filetype detection purely on filename extension, so I think it works well enough in most situations. If you're indexing web server content, most webservers set the mimetype from the filename extension by default anyway!

So feel free to patch away!

Incidentally, is it possible to open the file once and pass the file handle to libmagic, libextractor, etc?

Changed 22 months ago by nemesis

Incidentally, is it possible to open the file once and pass the file handle to libmagic, libextractor, etc?

Yes, the getKeywords2() function takes a buffer as an argument, so just mmap the file and pass in the buffer and length, same with magic_buffer().

Changed 3 months ago by olly

  • description modified (diff)
  • milestone set to 1.1.0

Did you ever get a chance to code this up?

I'm looking at what we want to try to get into Xapian 1.1.0, and this is a candidate, especially if there's already a working patch!

Changed 3 weeks ago by olly

  • milestone changed from 1.1.0 to 1.1.1

The current patch isn't ready to apply and this change could be made in 1.1.x, so bumping milestone.

Note: See TracTickets for help on using tickets.