Opened 12 years ago

Last modified 8 months ago

#583 new enhancement

Spin off Omega's filetype conversion code as a library

Reported by: Olly Betts Owned by: Olly Betts
Priority: low Milestone:
Component: Omega Version:
Severity: normal Keywords:
Cc: Kelson Blocked By:
Blocking: Operating System: All

Description

Creating a ticket for this so it doesn't get forgotten:

http://thread.gmane.org/gmane.comp.search.xapian.general/9189

This is potentially 1.3.x material, so setting milestone appropriately.

Change History (8)

comment:1 by Olly Betts, 8 years ago

Milestone: 1.3.x1.4.x

Not a blocker for 1.4.0.

comment:2 by Olly Betts, 4 years ago

Milestone: 1.4.x
Priority: normallow

gmane's web UI seems to have died a death, so I pulled up the message via NNTP and found it in our own archive: https://lists.xapian.org/pipermail/xapian-discuss/2011-November/008562.html

Adjusting the meta data - Bruno's extraction module support is relevant here, and that has been merged to git master but isn't on RELEASE/1.4, and also 1.4.x is really at the stage where we don't want potentially disruptive changes.

comment:3 by Kelson, 4 years ago

Cc: Kelson added

At openzim/Kiwix, a decade ago, we just have copied a bit of Omindex' code to index our HTML documents. At that time this was quite straight, but now would like to do things a bit more properly, ie. having the Xapian document parsing features of Omindex as a library. See https://github.com/openzim/libzim/issues/377 for more details.

The problem is not super urgent but important and we are willing to help by:

  • Emphasize the importance of this for us... and in general (surprisingly I don't see a lot of comments here)
  • Being an integration sparring partner at the early stage
  • Compilation tests on many OSses
  • We might be able to provide help for the Deb packaging as well

What is the current status on this really old feature request?

comment:4 by Olly Betts, 4 years ago

This isn't really something that's being actively worked on I'm afraid.

One key problem is that making something a public API locks down the design significantly - to actually be useful a public API needs to come with a commitment to stability, whereas a private API inside our code we can change pretty much at will (the only real concern is any patches in progress which touch the same code).

Another issue is that design decisions that make total sense in a narrower context are problematic for other potential uses.

Just hacking out code from inside omindex into a library and advertising it as a public API isn't really enough - this requires quite a lot of work to do properly.

If you look at the design in the old email message, it requires passing a filename. That's a reasonable requirement if you always have a local file you want to process, but requires a temporary file to be created in other situations (like spidering websites to index or indexing files extracted from compound file formats like ISOs, tarballs, ZIP archives, attachments from emails, etc).

And indeed in the meantime omindex's code has evolved to allow extracting files from a file descriptor. If we were working with a public API, that would have been much harder to do, because we'd have to have maintained compatibility with the existing API. Or else we'd have had to make an incompatible major version bump and forced all users of the API to rewrite their code. There are libraries that do that - I've used a few and they aren't fun to be a user of.

If we're going to add a public API for this (or anything else really) I think we need to do it well. Doing it badly doesn't actually help users, but still takes developer energy away from other areas.

comment:5 by Kelson, 4 years ago

Olly, thank you for your feedback. I think I perfectly agree with you. In our use case for example, we would need to pass strings, nothing to do with the filesystem at all.

Considering that the ticket is still open, I assume this is a path you are still wanting to follow, isn't it?

If you or any other senior dev of Xapian would have to do it (and assuming you could/would), how many days of work would you estimate to be able to release a first version?

Assuming this first version would have been released, could I assume this new library will be maintained by the core Xapian team?

comment:6 by Olly Betts, 8 months ago

Oddly I remember responding to this, but perhaps that was an email thread on the same topic. Anyway, summarising the current situation here:

Considering that the ticket is still open, I assume this is a path you are still wanting to follow, isn't it?

It's something I'm generally supportive of doing if we can do it well.

On git master, we have made significant steps towards being able to use the extraction code outside of omindex. Extractors for formats which are available as a library API are now effectively plugins (separate binaries which omindex runs in subprocesses, communicating via pipes), and at least conceptually we could make a public API for the small amount of code that gets linked into omindex to communicate with the plugins. In practice some things still really need sorting out first though.

Formats which are extracted via an external program (e.g. catdvi for DVI files) currently get run from the main omindex process. I think a "run an external command" plugin would make sense even within omindex since (at least on Linux) fork() can get unreasonably slow for a process with a large memory footprint due to the cost of copying the page mappings - moving the fork() to a plugin process should avoid this issue. It'd also be desirable to support this in a public API around these plugins, though it could reasonably be added later so long as we're confident it can be implemented without incompatible changes to that API.

Formats which are extracted entirely by code in the xapian repo are currently handled entirely in process. This includes HTML, SVG, CSV, Atom feeds. There doesn't seem to be a compelling reason for moving these to plugins as far as omindex is concerned, but they could be provided as a plugin or plugins for external use. Another option would be to provide a direct public API for the HTML parser so it could be used in other programs much like it is in omindex.

There's also PostScript for which we have hardcoded handling in-process (we convert via PDF by running ps2pdf then pdftotext as the direct convertors don't handle Unicode - I just checked and man pstotext still says "pstotext always translates to the ISO 8859-1 (Latin-1) character code"). There's a poppler plugin so probably we should move this support to that (it could probably also use libgs instead of running ps2pdf).

Currently input is provided by passing a filename to the plugin. That's mostly OK for omindex, though the in-process handling supports extracting from a file descriptor. You can pass an fd across a socket on Unix, so that could be supported. For your use it sounds like you'd like to be able to pass input in a buffer, which isn't currently supported but we could probably support that efficiently via a shared mmap() buffer or similar. This would also be useful for being able to chain plugins (e.g. extracting text from a file inside a Zip archive).

Ideally we'd have a testsuite for the new API, but we do at least have testing of omindex on git master which provides indirect testing of the plugins, and could probably be morphed into a testsuite for the API.

The other thing that's missing that may matter for some use cases is sandboxing of the plugins. If you're indexing data that may be actively hostile that brings additional concerns to those from the problem space omindex is aimed at. Having the extraction in a subprocess means it can't crash the main omindex process, but that's more aimed at avoiding problems from bugs in the extraction libraries being inadvertently triggered. E.g. if you were using this extraction code to handle attachments in a mail reader you'd really need robust sandboxing. Sadly modern sandboxing features tend to be platform-specific so probably we'll need to let people contribute sandboxing implementations for platforms they care about.

If you or any other senior dev of Xapian would have to do it (and assuming you could/would), how many days of work would you estimate to be able to release a first version?

It would depend a lot on which parts are hard requirements, but it's probably somewhere from a few days to a few weeks.

I did have a client funding work on this but they had a major technical restructuring a few months ago and I don't know if they're likely to continue. If you or someone else reading has a budget I'm happy to discuss.

Assuming this first version would have been released, could I assume this new library will be maintained by the core Xapian team?

Yeah, I wouldn't want to release something we weren't intending to maintain.

comment:7 by Eric Wong, 8 months ago

posix_spawn can use vfork on glibc and musl to avoid the overhead of forks which are followed by execve.

comment:8 by Olly Betts, 8 months ago

posix_spawn can use vfork on glibc and musl to avoid the overhead of forks which are followed by execve.

It's limited in what can be done in the new process before the exec() though - in particular we currently set resource limits there (to reduce the impact a runaway filter can have) which posix_spawn() doesn't appear to support.

Setting resource limits in the main omindex process before calling posix_spawn() which are then inherited by the new process seems liable to cause problems (e.g. if omindex is already using more memory than we want to limit the new child process to). Maybe we could use the Linux-specific prlimit() to set them from the parent after posix_spawn(), but that limits the new code to Linux.

I think it probably makes more sense to put the effort into moving launching filter programs to a plugin instead. That could probably set the resource limits in the plugin process and call posix_spawn() though the plugin process should have a small memory usage and so the slow fork() case shouldn't be something we run into anyway.

Last edited 8 months ago by Olly Betts (previous) (diff)
Note: See TracTickets for help on using tickets.