Opened 13 years ago

Closed 4 years ago

#517 closed enhancement (fixed)

omindex libextractor module

Reported by: Olly Betts Owned by: Olly Betts
Priority: lowest Milestone: 1.5.0
Component: Omega Version: git master
Severity: minor Keywords: GoodFirstBug
Cc: Ryan Underwood Blocked By:
Blocking: Operating System: All

Description

This is "son of #114" - that ticket was about using libmagic and libextractor, which is really two issues. The libmagic one is now done, but the libextractor remains. This would be a potentially disruptive change, which I think isn't appropriate to make mid-1.2 series, so marking as milestone:1.3.0.


patch to use libmagic and libextractor

This is a horrible hack, but you get the idea. A better setup would not bother with fileext/mimetypes that are known already to have no extractors available.


Summarising the relevant parts of #114:

libextractor plus points:

  • Has plugins for many file types
  • Extracts metadata as well as text
  • Saves us having to maintain code to perform filtering for so many formats

Issues:

  • Haven't compared output quality with existing filters
  • Current libextractor API (at least when #114 was filed) doesn't distinguish between not having a plugin for a format, and the format not having metadata to extract, which makes it hard to efficiently fall back to other filters.

We could use libextractor as a toolbox of filters which we pick from ourselves:

I see one option besides taking this speed hit (which I believe forcing upon the user would be contrary to the design of omindex, since that was the whole point of removing file extensions from the map that are not handled by index_file).

This would be to map MIME types directly to libextractor plugins.

The maintainer guarantees that the name of libextractor plugins is static. So we have the filename-to-MIME-type-map, to save the open if the user doesn't want to use libmagic (libmagic for more accurate MIME type identification).

Then we add a MIME-type-to-libextractor-plugin map, so that we check the MIME type of a file passed to index_file, and call libextractor with an ExtractorList only including the plugin for that one file's MIME type.

Drawbacks: - Requires a priori knowledge of what plugins libextractor currently has in order to add new ones, but it shouldn't change that frequently. Essentially libextractor is a swiss army knife text extractor, and adding a new format it supports is conceptually similar to adding support for a new filter program or library (but less work!)

  • If the file extension is wrong, mime_map is wrong, or libmagic screws up fingerprinting the file, we extract empty keyword set because the wrong libextractor plugin is called. But that is already the case with omindex because it currently depends on the file extension being correct, and webservers typically pick the content-type to server based on the extension anyway.

Change History (5)

comment:1 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:2 by Olly Betts, 9 years ago

Milestone: 1.3.x1.4.x

I'm not entirely sold on libextractor as being a great fit for omindex's needs, but if someone wants to experiment with this I'm happy to review results and help get changes merged if they merit it.

This isn't something to hold up 1.4.0 for though.

comment:3 by Olly Betts, 8 years ago

A couple more issues:

  • libextractor doesn't seem to be very actively maintained - last release was just over 2 years ago, and there's only one mailing list thread since then. That's not particularly appealing for a proposed new dependency.
  • At least in the current Debian packaging, libextractor3 seems to unconditionally drag in a lot of libraries. I'm not wild about the idea that installing omega would force you to install ffmpeg/libav, gstreamer, gtk3, pango, etc. These might well be installed on a typical desktop system already, but probably won't be in a server environment, which is where omega is typically used.

comment:4 by Olly Betts, 5 years ago

Keywords: GoodFirstBug added
Milestone: 1.4.x
Priority: normallowest
Severity: normalminor
Summary: omindex: could use libextractor for many formatsomindex libextractor module
Version: git master

There's now support on git master for worker modules for extraction libraries, so the best way to support this now would be via that mechanism. This somewhat isolates the main indexer process from bugs in libextractor, and also allows binary packages to be split so only people who want to use libextractor have to install it and all its dependencies.

I think doing that would be suitable for somebody new to the codebase and wanting to get to grips with it.

comment:5 by Olly Betts, 4 years ago

Milestone: 1.5.0
Resolution: fixed
Status: newclosed

Fixed by b1a72b53a8c96446918261e954af1e3fcbf14720 on git master.

This uses the worker module machinery that's only on git master, so not suitable for backporting.

Note: See TracTickets for help on using tickets.