Opened 16 years ago
Closed 14 years ago
#324 closed enhancement (fixed)
A Script that uses OpenOffice to filter text for Xapian Omega
Reported by: | Frank J Bruzzaniti | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.6 |
Component: | Omega | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | Linux |
Description (last modified by )
This python script is an example of how to use openoffice to convert documents to text. It starts an headless version of openoffice which should remain running and will attempt to start a new instance if it is not. It also uses Unoconv which can be downloaded from http://dag.wieers.com/home-made/unoconv/.
Unoconv doesn't need to be told what format it is accepting so you should be able to slot the script anywhere in omindex without to much hassle. For example I replaced antiword in omindex.cc with oOC.py (this script) because antiword couldn't open .doc's saved via Word Perfect
I would love to get some high end stability and performance testing using OpenOffice as a filter. I couldn't figure out how to get python to correctly marshal the soffice process hence I parsed the output of ps command. Maybe one of the python guru's could have a look :)
Attachments (3)
Change History (11)
by , 16 years ago
comment:1 by , 16 years ago
Component: | Examples → Omega |
---|---|
Description: | modified (diff) |
Sadly antiword isn't getting many updates now, so the option to use something more actively maintained would be useful. Perhaps openoffice is a bit heavyweight, but the ability to use a single instance in the background should at least mean the runtime overhead isn't an issue.
As you suggest, the "ps" stuff really needs replacing with something better (amongst other things, "ps -ef" isn't portable, and you're hard-coding the install location).
This script also assumes nothing is running on port 2002, and allows other users on the system to do things with your openoffice process, which is a potential security risk.
I had a quick look at unoconv and it looks like you'd do better to (but I've not tested either):
- use
unoconv --listener
to start a persistent openoffice process - use
unoconv --pipe
to use a named pipe to communicate with openoffice instead of a TCP socket
comment:2 by , 16 years ago
Keywords: | open office convert added |
---|
Found a macro and created a bash script that seems to work ok as a proof of concept. It doesn't require openoffice to be running in the background like my previous script.
At this stage the macro is converting pdf,pptp,pps,doc,xls to html as OpenOffice supports HTML export for almost everything while only certain documents can be exported to text.
I guess we can use Xapian's method for parsing HTML in conjunction with this, although what's in the bash script is prob better off converting to C (which I'm not any good at).
I've commented the bash script so it should be easy enough to follow.
comment:3 by , 16 years ago
Keywords: | openoffice added; open office removed |
---|
Altered the bash script so it dose cat *.html to show output. Remove deletion of files created.
It's just a proof of concept at the moment, I'm hoping someone will be able to help to verify that it's a good idea and help me integrate it into omindex.
Hopefully it will server as a replacement to unmaintained packages like antiword
comment:4 by , 16 years ago
Keywords: | openoffice convert removed |
---|---|
Summary: | A Script that users OpenOffice to filter text for Xapian Omega → A Script that uses OpenOffice to filter text for Xapian Omega |
I can see some people might prefer to use openoffice for such things. I don't think it's a suitable ubiquitous replacement for antiword - openoffice is just too heavyweight as a default. I think the best candidate for a replacement for antiword as the default is wvWare.
This script is problematic in a few ways though. E.g. just killing off any existing openoffice processes is very hostile - what if I'm writing a report and cron kicks off an index update on the same machine? Also, it doesn't appear to clean up temporary files, and since it does cat *.html
that seems to mean it will produce the output from every previous file processed!
Also, openoffice is rather slow to start up, so I think we really would want to use a persistent instance.
comment:5 by , 16 years ago
I agree a persistent instance would be useful but I've had issues regarding it being buggy, I've seen posts from a few people who have claimed the same but I suspect it could be unoconv since I know the Alfresco project uses OpenOffice in such a way maybe it's worth looking in to.
As for my script, it's not complete but it's a start. I guess I could dump the files in a "temporary" directory, do the cat *.html then delete the directory.
If omindex is running as another use you could just do "killall -u xapian" for example but I better way would be to use regexp to find any processes running my macro.
But I'll have a look at using a persistent instance using the python bindings when I get a chance and avoid unoconv.
comment:6 by , 14 years ago
I've just added a new --filter option to omindex in trunk r15169, so it's now possible to implement this simply by running omindex something like so:
#sh soffice -headless -accept='socket,host=127.0.0.1,port=2002;urp;' -nofirststartwizard omindex --filter=application/msword:'unoconv --stdout -f text' [args...]
So I think it's probably best to just make this an example on the wiki or in the Omega documentation.
comment:7 by , 14 years ago
Milestone: | → 1.2.x |
---|---|
Status: | new → assigned |
Someone on the mailing list noted a patch needed with unoconv 0.4 and an unresolved issue:
http://article.gmane.org/gmane.comp.search.xapian.general/8640
Documenting this as a more advanced --filter example is suitable for 1.2.x (at least once such issues are resolved) so setting milestone.
comment:8 by , 14 years ago
Milestone: | 1.2.x → 1.2.6 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
A cleaner way to launch the headless OpenOffice is unoconv --listener &
.
I've added this as an example to docs/overview.rst in r15470.
OpenOffice Filter Script