Opened 16 years ago
Closed 11 years ago
#383 closed enhancement (fixed)
Patch to replace antiword with abiword
Reported by: | Frank J Bruzzaniti | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.17 |
Component: | Omega | Version: | 1.0.13 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
This patch replaces antiword with abiword. Patched against omega 1.0.13
Attachments (7)
Change History (17)
by , 16 years ago
Attachment: | abiword.patch added |
---|
comment:1 by , 16 years ago
I didn't make a patch for wvWare as they claim on there website (http://wvware.sourceforge.net/) that wvWare should be considered deprecated in favor of using AbiWord.
comment:2 by , 16 years ago
Um, we've already had essentially this same discussion on the mailing list (and you took part!)
http://thread.gmane.org/gmane.comp.search.xapian.general/7310/focus=7347
To summarise, the current status is:
- it appears antiword is unmaintained (that's not actually a huge issue if it does the job, though it's certainly not a positive thing).
- openoffice could replace it, but we don't currently have a clean solution, and openoffice is a rather heavyweight dependency, so it would be better to have a lightweight default with openoffice as an option.
- abiword could also replace it, but it's also not terribly lightweight (probably not as bad as openoffice though).
- we could easily replace antiword with wvWare, but wvWare is ~5 times slower, which is a bit of a hit to take just to be using a more actively maintained extractor - I feel there needs to be a more concrete benefit to be gained to justify this.
- my (admittedly limited, as I don't have many .doc examples) testing showed equivalently good results from antiword and wvWare (and abiword to be essentially identical to wvWare, as you would expect).
- you've claimed that antiword fails to correctly extract text from some documents, but didn't respond to my request for examples of such documents, so it's hard for me to judge how serious this is for myself. I haven't seen such reports from anyone else, but perhaps nobody else has looked at antiword's output in detail. It's hard for me to tell with the information I currently have...
comment:3 by , 16 years ago
Sorry I didn't mean to stir the pot, I've included an example with the output of all three programs from a file that didn't work 100% correctly.
The issue was the title of the document was called Agenda. Someone naturally searched for it but for some reason antiword misses the title.
I'll post other examples as I come across them again.
by , 16 years ago
Attachment: | Agenda 20090422 A.doc added |
---|
Example of "buggy" doc file. The title Agender is not displayed using antiword
by , 16 years ago
Attachment: | antiword.txt added |
---|
antiword output from processing Agenda 20090422 A.doc
by , 16 years ago
Attachment: | abiword.txt added |
---|
abiword output from processing Agenda 20090422 A.doc
comment:4 by , 16 years ago
BTW I wasn't looking to replace antiword, I agree if you need a fast default word converter then antiword is probably the way to go, I found this patch useful because (apart from the example I submitted) I had noticed a client of mine has a heap of word documents from circa 1998 that were saved as .doc's by Wordperfect.
Unfortunately it was there defunct word processor and file format so I needed an alternative. So since i found this patch useful I just thought I'd share it in case anyone else had any weirdness.
I can't upload the documents for testing as they are client files but this is the error I get.
antiword: Word2: fast saved documents are not supported yet
wvText: Could not convert into HTML
abiword: I can all see the text in the body of the document with a few extra little "artifacts". The start and end of the text file have perhaps 3-5 lines like: ¥-?!@?????-??????????????v ??_???????????????????ö???????????????????????????????????????$?????$?$?????$?????$?????$?????$?????2??? ?R?????R?????R?????R?????R??? ?\?????R?????l???F?²?????²?????²?????²?????²?????²?????²?????²?????²?????´?????´?????´?????´?????´?????´?????ñ???4?%???:?Ò?????$???????????Ò????
Open Office: Opens perfectly
comment:5 by , 16 years ago
I've found "safe" example of a document that I think was saved as .doc by wordperfect.
NEWS4945.DOC
I've included the output
NEWS4945.txt
I've attached both to this ticket.
In my case antiword would of meant years of documents would have to be missed in my scenario. Hence the patch for those who need it.
by , 16 years ago
Attachment: | NEWS4945.DOC added |
---|
Example of wordperfect from 1995 saving in .doc format.
by , 16 years ago
Attachment: | abiword_NEWS4945.txt added |
---|
Example of abiword output from processing wordperfect file from 1995 saved in .doc format.
comment:6 by , 14 years ago
Further points against antiword:
- On small documents, antiword gives up with "I'm afraid the text stream of this file is too small to handle." but wxText works.
- wxText extracts text in headers and footers, but antiword doesn't
comment:7 by , 14 years ago
Component: | Other → Omega |
---|---|
Milestone: | → 1.2.x |
Status: | new → assigned |
Looks like antiword doesn't work with "fast saved" documents (which essentially just append a delta to the previously saved version of the file, which was typically much faster in the days of floppies, but isn't now. It seems not to be no longer used by default, and is more complex to implement, so it's not as widely supported.
We can make this change pretty seamlessly, so marking as suitable for 1.2.x.
Not sure if we should just replace antiword completely, or allow which filter to use to be configured.
comment:9 by , 11 years ago
Milestone: | 1.3.x → 1.2.17 |
---|
Since this ticket was opened, omindex has gained a --filter command line option which provides a way for the user to specify their own filter, so I've added the abiword command line from the patch as a nice example of how to use --filter on trunk in r17754. Marking to backport for 1.2.17.
comment:10 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Backported for 1.2.17 in r17756.
Patch to replace Antiword with Abiword