Opened 16 years ago

Closed 14 years ago

#324 closed enhancement (fixed)

A Script that uses OpenOffice to filter text for Xapian Omega

Reported by: Frank J Bruzzaniti Owned by: Olly Betts
Priority: normal Milestone: 1.2.6
Component: Omega Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: Linux

Description (last modified by Olly Betts)

This python script is an example of how to use openoffice to convert documents to text. It starts an headless version of openoffice which should remain running and will attempt to start a new instance if it is not. It also uses Unoconv which can be downloaded from http://dag.wieers.com/home-made/unoconv/.

Unoconv doesn't need to be told what format it is accepting so you should be able to slot the script anywhere in omindex without to much hassle. For example I replaced antiword in omindex.cc with oOC.py (this script) because antiword couldn't open .doc's saved via Word Perfect

I would love to get some high end stability and performance testing using OpenOffice as a filter. I couldn't figure out how to get python to correctly marshal the soffice process hence I parsed the output of ps command. Maybe one of the python guru's could have a look :)

Attachments (3)

oOf.py (1.5 KB ) - added by Frank J Bruzzaniti 16 years ago.
OpenOffice Filter Script
SaveAsHTML.macro (1002 bytes ) - added by Frank J Bruzzaniti 16 years ago.
Open Office Macro to convert to HTML
SaveAsHTML.sh (579 bytes ) - added by Frank J Bruzzaniti 16 years ago.
Bash script to run open office macro

Download all attachments as: .zip

Change History (11)

by Frank J Bruzzaniti, 16 years ago

Attachment: oOf.py added

OpenOffice Filter Script

comment:1 by Olly Betts, 16 years ago

Component: ExamplesOmega
Description: modified (diff)

Sadly antiword isn't getting many updates now, so the option to use something more actively maintained would be useful. Perhaps openoffice is a bit heavyweight, but the ability to use a single instance in the background should at least mean the runtime overhead isn't an issue.

As you suggest, the "ps" stuff really needs replacing with something better (amongst other things, "ps -ef" isn't portable, and you're hard-coding the install location).

This script also assumes nothing is running on port 2002, and allows other users on the system to do things with your openoffice process, which is a potential security risk.

I had a quick look at unoconv and it looks like you'd do better to (but I've not tested either):

  • use unoconv --listener to start a persistent openoffice process
  • use unoconv --pipe to use a named pipe to communicate with openoffice instead of a TCP socket

comment:2 by Frank J Bruzzaniti, 16 years ago

Keywords: open office convert added

Found a macro and created a bash script that seems to work ok as a proof of concept. It doesn't require openoffice to be running in the background like my previous script.

At this stage the macro is converting pdf,pptp,pps,doc,xls to html as OpenOffice supports HTML export for almost everything while only certain documents can be exported to text.

I guess we can use Xapian's method for parsing HTML in conjunction with this, although what's in the bash script is prob better off converting to C (which I'm not any good at).

I've commented the bash script so it should be easy enough to follow.

by Frank J Bruzzaniti, 16 years ago

Attachment: SaveAsHTML.macro added

Open Office Macro to convert to HTML

by Frank J Bruzzaniti, 16 years ago

Attachment: SaveAsHTML.sh added

Bash script to run open office macro

comment:3 by Frank J Bruzzaniti, 16 years ago

Keywords: openoffice added; open office removed

Altered the bash script so it dose cat *.html to show output. Remove deletion of files created.

It's just a proof of concept at the moment, I'm hoping someone will be able to help to verify that it's a good idea and help me integrate it into omindex.

Hopefully it will server as a replacement to unmaintained packages like antiword

comment:4 by Olly Betts, 16 years ago

Keywords: openoffice convert removed
Summary: A Script that users OpenOffice to filter text for Xapian OmegaA Script that uses OpenOffice to filter text for Xapian Omega

I can see some people might prefer to use openoffice for such things. I don't think it's a suitable ubiquitous replacement for antiword - openoffice is just too heavyweight as a default. I think the best candidate for a replacement for antiword as the default is wvWare.

This script is problematic in a few ways though. E.g. just killing off any existing openoffice processes is very hostile - what if I'm writing a report and cron kicks off an index update on the same machine? Also, it doesn't appear to clean up temporary files, and since it does cat *.html that seems to mean it will produce the output from every previous file processed!

Also, openoffice is rather slow to start up, so I think we really would want to use a persistent instance.

comment:5 by Frank J Bruzzaniti, 16 years ago

I agree a persistent instance would be useful but I've had issues regarding it being buggy, I've seen posts from a few people who have claimed the same but I suspect it could be unoconv since I know the Alfresco project uses OpenOffice in such a way maybe it's worth looking in to.

As for my script, it's not complete but it's a start. I guess I could dump the files in a "temporary" directory, do the cat *.html then delete the directory.

If omindex is running as another use you could just do "killall -u xapian" for example but I better way would be to use regexp to find any processes running my macro.

But I'll have a look at using a persistent instance using the python bindings when I get a chance and avoid unoconv.

comment:6 by Olly Betts, 14 years ago

I've just added a new --filter option to omindex in trunk r15169, so it's now possible to implement this simply by running omindex something like so:

#sh
soffice -headless -accept='socket,host=127.0.0.1,port=2002;urp;' -nofirststartwizard
omindex --filter=application/msword:'unoconv --stdout -f text' [args...]

So I think it's probably best to just make this an example on the wiki or in the Omega documentation.

comment:7 by Olly Betts, 14 years ago

Milestone: 1.2.x
Status: newassigned

Someone on the mailing list noted a patch needed with unoconv 0.4 and an unresolved issue:

http://article.gmane.org/gmane.comp.search.xapian.general/8640

Documenting this as a more advanced --filter example is suitable for 1.2.x (at least once such issues are resolved) so setting milestone.

comment:8 by Olly Betts, 14 years ago

Milestone: 1.2.x1.2.6
Resolution: fixed
Status: assignedclosed

A cleaner way to launch the headless OpenOffice is unoconv --listener &.

I've added this as an example to docs/overview.rst in r15470.

Note: See TracTickets for help on using tickets.