Opened 12 years ago

Closed 8 years ago

#383 closed enhancement (fixed)

Patch to replace antiword with abiword

Reported by: Frank J Bruzzaniti Owned by: Olly Betts
Priority: normal Milestone: 1.2.17
Component: Omega Version: 1.0.13
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

This patch replaces antiword with abiword. Patched against omega 1.0.13

Attachments (7)

abiword.patch (457 bytes ) - added by Frank J Bruzzaniti 12 years ago.
Patch to replace Antiword with Abiword
Agenda 20090422 A.doc (102.0 KB ) - added by Frank J Bruzzaniti 12 years ago.
Example of "buggy" doc file. The title Agender is not displayed using antiword
antiword.txt (1.8 KB ) - added by Frank J Bruzzaniti 12 years ago.
antiword output from processing Agenda 20090422 A.doc
wvText.txt (3.3 KB ) - added by Frank J Bruzzaniti 12 years ago.
wvText output from processing Agenda 20090422 A.doc
abiword.txt (709 bytes ) - added by Frank J Bruzzaniti 12 years ago.
abiword output from processing Agenda 20090422 A.doc
NEWS4945.DOC (82.5 KB ) - added by Frank J Bruzzaniti 12 years ago.
Example of wordperfect from 1995 saving in .doc format.
abiword_NEWS4945.txt (150.2 KB ) - added by Frank J Bruzzaniti 12 years ago.
Example of abiword output from processing wordperfect file from 1995 saved in .doc format.

Download all attachments as: .zip

Change History (17)

by Frank J Bruzzaniti, 12 years ago

Attachment: abiword.patch added

Patch to replace Antiword with Abiword

comment:1 by Frank J Bruzzaniti, 12 years ago

I didn't make a patch for wvWare as they claim on there website (http://wvware.sourceforge.net/) that wvWare should be considered deprecated in favor of using AbiWord.

comment:2 by Olly Betts, 12 years ago

Um, we've already had essentially this same discussion on the mailing list (and you took part!)

http://thread.gmane.org/gmane.comp.search.xapian.general/7310/focus=7347

To summarise, the current status is:

  • it appears antiword is unmaintained (that's not actually a huge issue if it does the job, though it's certainly not a positive thing).
  • openoffice could replace it, but we don't currently have a clean solution, and openoffice is a rather heavyweight dependency, so it would be better to have a lightweight default with openoffice as an option.
  • abiword could also replace it, but it's also not terribly lightweight (probably not as bad as openoffice though).
  • we could easily replace antiword with wvWare, but wvWare is ~5 times slower, which is a bit of a hit to take just to be using a more actively maintained extractor - I feel there needs to be a more concrete benefit to be gained to justify this.
  • my (admittedly limited, as I don't have many .doc examples) testing showed equivalently good results from antiword and wvWare (and abiword to be essentially identical to wvWare, as you would expect).
  • you've claimed that antiword fails to correctly extract text from some documents, but didn't respond to my request for examples of such documents, so it's hard for me to judge how serious this is for myself. I haven't seen such reports from anyone else, but perhaps nobody else has looked at antiword's output in detail. It's hard for me to tell with the information I currently have...

comment:3 by Frank J Bruzzaniti, 12 years ago

Sorry I didn't mean to stir the pot, I've included an example with the output of all three programs from a file that didn't work 100% correctly.

The issue was the title of the document was called Agenda. Someone naturally searched for it but for some reason antiword misses the title.

I'll post other examples as I come across them again.

by Frank J Bruzzaniti, 12 years ago

Attachment: Agenda 20090422 A.doc added

Example of "buggy" doc file. The title Agender is not displayed using antiword

by Frank J Bruzzaniti, 12 years ago

Attachment: antiword.txt added

antiword output from processing Agenda 20090422 A.doc

by Frank J Bruzzaniti, 12 years ago

Attachment: wvText.txt added

wvText output from processing Agenda 20090422 A.doc

by Frank J Bruzzaniti, 12 years ago

Attachment: abiword.txt added

abiword output from processing Agenda 20090422 A.doc

comment:4 by Frank J Bruzzaniti, 12 years ago

BTW I wasn't looking to replace antiword, I agree if you need a fast default word converter then antiword is probably the way to go, I found this patch useful because (apart from the example I submitted) I had noticed a client of mine has a heap of word documents from circa 1998 that were saved as .doc's by Wordperfect.

Unfortunately it was there defunct word processor and file format so I needed an alternative. So since i found this patch useful I just thought I'd share it in case anyone else had any weirdness.

I can't upload the documents for testing as they are client files but this is the error I get.

antiword: Word2: fast saved documents are not supported yet

wvText: Could not convert into HTML

abiword: I can all see the text in the body of the document with a few extra little "artifacts". The start and end of the text file have perhaps 3-5 lines like: ¥-?!@?????-???????????€???v ??_???????????????????ö???????????????????????????????????????$?????$?$?????$?????$?????$?????$?????2??? ?R?????R?????R?????R?????R??? ?\?????R?????l???F?²?????²?????²?????²?????²?????²?????²?????²?????²?????´?????´?????´?????´?????´?????´?????ñ???4?%???:?Ò?????$???????????Ò????

Open Office: Opens perfectly

comment:5 by Frank J Bruzzaniti, 12 years ago

I've found "safe" example of a document that I think was saved as .doc by wordperfect.

NEWS4945.DOC

I've included the output

NEWS4945.txt

I've attached both to this ticket.

In my case antiword would of meant years of documents would have to be missed in my scenario. Hence the patch for those who need it.

by Frank J Bruzzaniti, 12 years ago

Attachment: NEWS4945.DOC added

Example of wordperfect from 1995 saving in .doc format.

by Frank J Bruzzaniti, 12 years ago

Attachment: abiword_NEWS4945.txt added

Example of abiword output from processing wordperfect file from 1995 saved in .doc format.

comment:6 by Olly Betts, 11 years ago

Further points against antiword:

  • On small documents, antiword gives up with "I'm afraid the text stream of this file is too small to handle." but wxText works.
  • wxText extracts text in headers and footers, but antiword doesn't

comment:7 by Olly Betts, 11 years ago

Component: OtherOmega
Milestone: 1.2.x
Status: newassigned

Looks like antiword doesn't work with "fast saved" documents (which essentially just append a delta to the previously saved version of the file, which was typically much faster in the days of floppies, but isn't now. It seems not to be no longer used by default, and is more complex to implement, so it's not as widely supported.

We can make this change pretty seamlessly, so marking as suitable for 1.2.x.

Not sure if we should just replace antiword completely, or allow which filter to use to be configured.

comment:8 by Olly Betts, 9 years ago

Milestone: 1.2.x1.3.x
Version: 1.0.13

1.3.x material now.

comment:9 by Olly Betts, 8 years ago

Milestone: 1.3.x1.2.17

Since this ticket was opened, omindex has gained a --filter command line option which provides a way for the user to specify their own filter, so I've added the abiword command line from the patch as a nice example of how to use --filter on trunk in r17754. Marking to backport for 1.2.17.

comment:10 by Olly Betts, 8 years ago

Resolution: fixed
Status: assignedclosed

Backported for 1.2.17 in r17756.

Note: See TracTickets for help on using tickets.