Opened 18 years ago
Closed 14 years ago
#114 closed enhancement (fixed)
Use libmagic or libextractor instead of own MIME mappings and extractions
Reported by: | Ryan Underwood | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.4 |
Component: | Omega | Version: | SVN trunk |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
Hello,
I locally first modified omindex to use libmagic's MIME database, instead of hard coding the MIME type to file extension mapping. This ensures that the internally used MIME types are more consistent with accepted standard types.
Then I went further and instead of using file extensions to determine type, used libmagic to fingerprint the files. This is slower, but ensures that the file actually is identified correctly even if the extension is wrong.
Now I am using libextractor to actually extract the metadata from the file, instead of calling these external programs inside omindex based on the MIME type. Using libextractor greatly simplifies omindex.
Is anyone interested in these modifications?
Attachments (1)
Change History (16)
comment:1 by , 18 years ago
comment:2 by , 18 years ago
Using libextractor means that libmagic is unnecessary. libextractor plugins claim the file content themselves if it is correct. There is one problem with using libextractor currently, the API gives no way to determine the difference between whether a file was ignored, or it had an empty keyword set. So omindex would be inefficient, continuing to try to extract file types for which no extractor plugin exists. I have emailed the libextractor maintainers about that issue. In the meantime, perhaps it is a good idea to phase in optional libextractor support. I will see about writing the appropriate autoconf stuff to detect its presence and use it if the user enabled it.
comment:3 by , 18 years ago
This is a horrible hack, but you get the idea. A better setup would not bother with fileext/mimetypes that are known already to have no extractors available.
comment:4 by , 18 years ago
Oh, you also need to check for -lmagic and -lextractor in the configure script. I hard coded them into the makefile for now.
comment:5 by , 18 years ago
Operating System: | → All |
---|---|
Priority: | highest → normal |
rep_platform: | PC → All |
Status: | new → assigned |
Doesn't look too bad!
It would be better to avoid stringstream which isn't available on some compilers we try to support - in this case it can just be replaced by appending directly to "dump", which is probably clearer and at least as efficient anyway.
This is quite a major structural change, so I think this is probably something to look at merging after 1.0 is out. My aim is to have that released at some point in April, so I'm trying to avoid changes with the potential to destabilise at the moment.
Does libmagic detect the character set of text files usefully? I played with GNU file and that is able to at least say "ASCII", "ISO-8859" or "UTF-8", which certainly beats assuming that all text files are UTF-8. If so, that part would certainly be useful for 1.0.
comment:6 by , 18 years ago
Yes it can detect the charset, that's why I use strncmp() on the MIME string, because something like text/plain; charset=us-ascii would be output otherwise, befuddling the check.
Waiting for 1.0 will be fine, since it's unclear if libextractor upstream will move quickly on my request for the API change.
comment:7 by , 18 years ago
I also noticed that I moved the md5 calculation block to the wrong place, it should be performed unconditionally.
comment:8 by , 18 years ago
Ok, I had a discussion with libextractor maintainer. His problem with my idea is that there is no way to tell simply based on a file extension or MIME type that the extractor plugin will be able to handle that type (consider different versions of a file specification like PDF), so he doesn't want to change the API to do something that he feels is stupid. He did point out that libextractor only opens and mmaps the file before going through the plugins, so I/O is minimal. But since there is no way to tell if libextractor is going to be able to handle the particular file we feed to it, we would always waste the open.
I see one option besides taking this speed hit (which I believe forcing upon the user would be contrary to the design of omindex, since that was the whole point of removing file extensions from the map that are not handled by index_file).
This would be to map MIME types directly to libextractor plugins.
The maintainer guarantees that the name of libextractor plugins is static. So we have the filename-to-MIME-type-map, to save the open if the user doesn't want to use libmagic (libmagic for more accurate MIME type identification).
Then we add a MIME-type-to-libextractor-plugin map, so that we check the MIME type of a file passed to index_file, and call libextractor with an ExtractorList only including the plugin for that one file's MIME type.
Drawbacks:
- Requires a priori knowledge of what plugins libextractor currently has in
order to add new ones, but it shouldn't change that frequently.
- If the file extension is wrong, mime_map is wrong, or libmagic screws up
fingerprinting the file, we extract empty keyword set because the wrong libextractor plugin is called. But that is already the case with omindex because it currently depends on the file extension being correct.
So, I think it will work. If you think this approach works, I'll code it.
comment:9 by , 18 years ago
It seems it would still be useful to know if libextractor failed to extract keywords, or if the file simply has no meta-data, since in the former case you might want to try running other "convert to text" libraries or utilities, whereas in the later case that would probably be a waste of time.
But I do see where the libextractor maintainer is coming from.
Drawbacks:
- Requires a priori knowledge of what plugins libextractor currently has in
order to add new ones, but it shouldn't change that frequently.
Yeah, I don't see that as a huge issue. Essentially libextractor is a swiss army knife text extractor, and adding a new format it supports is conceptually similar to adding support for a new filter program or library (but less work!)
- If the file extension is wrong, mime_map is wrong, or libmagic screws up
fingerprinting the file, we extract empty keyword set because the wrong libextractor plugin is called. But that is already the case with omindex because it currently depends on the file extension being correct.
I don't recall anyone complaining before about basing the filetype detection purely on filename extension, so I think it works well enough in most situations. If you're indexing web server content, most webservers set the mimetype from the filename extension by default anyway!
So feel free to patch away!
Incidentally, is it possible to open the file once and pass the file handle to libmagic, libextractor, etc?
comment:10 by , 18 years ago
Incidentally, is it possible to open the file once and pass the file handle to libmagic, libextractor, etc?
Yes, the getKeywords2() function takes a buffer as an argument, so just mmap the file and pass in the buffer and length, same with magic_buffer().
comment:12 by , 16 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.1.0 |
Did you ever get a chance to code this up?
I'm looking at what we want to try to get into Xapian 1.1.0, and this is a candidate, especially if there's already a working patch!
comment:13 by , 16 years ago
Milestone: | 1.1.0 → 1.1.1 |
---|
The current patch isn't ready to apply and this change could be made in 1.1.x, so bumping milestone.
comment:16 by , 14 years ago
Milestone: | 1.2.x → 1.2.4 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
SVN trunk now uses libmagic to get a MIME content-type if it doesn't have a mapping for the extension. So the user can choose to use extensions to determine MIME content-type (as before), or use libmagic to look at the files (more accurate, but a little slower), or assume some extensions are correct but use magic for others.
I've split the libextractor side of this ticket off as #517 with a summary of the relevant discussion from here, and I'm closing this ticket.
Yes, these sound interesting. I'd actually had a quick look at libextractor already and bookmarked it as "interesting", and I've also been wondering about using something like libmagic to decide what the encoding of text files is...
I think testing with libmagic probably should be optional. It seems a useful feature, but in many situations I'd certainly be happy to assume that the file extensions are correct in exchange for faster indexing.