Opened 16 years ago
Closed 16 years ago
#290 closed enhancement (fixed)
Omega support for Office 2007 Documents
Reported by: | Frank J Bruzzaniti | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.1.2 |
Component: | Omega | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
Support for Word, Excel, Powerpoint committed to trunk as r11866; XPS as r11904; macroenabled versions in r12892; extracting of ppt notes and comments (if present) in r12893.
Original description:
This patch uses the xmlparser and unzip to extract and process strings from *.xlsx and *.docx files.
P.S. First time I have used svn to create a diff or Trac so forgive me if I've screwed something up :)
Attachments (7)
Change History (28)
by , 16 years ago
Attachment: | omindex.diff added |
---|
comment:1 by , 16 years ago
Keywords: | Office 2007 Excel Word removed |
---|---|
Status: | new → assigned |
follow-up: 5 comment:2 by , 16 years ago
Milestone: | 1.0.8 → 1.1.0 |
---|
I've taken a closer look at the patch. It looks good apart from the lack of documentation updates and example files. I'm afraid I don't currently have the time to update the documentation or track down suitable examples myself right now, so I'm moving the milestone to 1.1.0.
While checking the content-types used were appropriate (oddly they aren't listed by IANA, but they are mentioned in posts of blogs.msdn.com so I guess they're OK) I found there are some other formats from which we can probably extract text in the same way.
If you can comment on any of the following, that would help. Otherwise I'll research as time allows.
http://blogs.msdn.com/dmahugh/archive/2006/08/08/692600.aspx lists more extensions and content types:
- Are the weirdly-named "macroEnabled.12" variants compatible formats?
- We handle .dot files, so should handle .dotx too if possible.
- We should handle .ppsx and .pptx if the same approach works (and .ppsm and .pptm if they have the same format).
- Ditto .xps.
http://blogs.msdn.com/ericwhite/pages/the-openxmldocument-class.aspx also mentions "drawings".
comment:3 by , 16 years ago
http://www.lesbonscomptes.com/recoll/filters/rclopxml seems to show the paths to look for in a slideshow or presentation.
comment:4 by , 16 years ago
Milestone: | 1.1.0 → 1.1.1 |
---|
Bumping milestone to 1.1.1 as this is ready to apply and isn't an incompatible change.
by , 16 years ago
Attachment: | ms2007.patch added |
---|
Here's a new patch that includes support for Power Point (.pptx)
by , 16 years ago
Attachment: | overview.rst added |
---|
I've updated the doco, not sure if it was ment to be a diff. If so let me know
comment:5 by , 16 years ago
Replying to olly:
I've taken a closer look at the patch. It looks good apart from the lack of documentation updates and example files. I'm afraid I don't currently have the time to update the documentation or track down suitable examples myself right now, so I'm moving the milestone to 1.1.0.
While checking the content-types used were appropriate (oddly they aren't listed by IANA, but they are mentioned in posts of blogs.msdn.com so I guess they're OK) I found there are some other formats from which we can probably extract text in the same way.
If you can comment on any of the following, that would help. Otherwise I'll research as time allows.
http://blogs.msdn.com/dmahugh/archive/2006/08/08/692600.aspx lists more extensions and content types:
- Are the weirdly-named "macroEnabled.12" variants compatible formats?
- We handle .dot files, so should handle .dotx too if possible.
- We should handle .ppsx and .pptx if the same approach works (and .ppsm and .pptm if they have the same format).
- Ditto .xps.
http://blogs.msdn.com/ericwhite/pages/the-openxmldocument-class.aspx also mentions "drawings".
XPS format is do-able, it's very similar.
Office 2007 mimetypes are here: http://blogs.msdn.com/vsofficedeveloper/pages/Office-2007-Open-XML-MIME-Types.aspx
Can't find any mention of a "macroEnabled.12" variant for any of the "openxmlformats" like docx: I'll see if I can filter this lot: .docx .dotx .xlsx .xltx .pptx .potx .ppsx
BTW for test documents if I use some text in jap and english would that work?
comment:6 by , 16 years ago
I've created a patch which contains support for .docx .dotx .xlsx .xltx .pptx .potx .ppsx It's also has msg and last_mod patched into it. Do I need to separate into another patch?
I've also uploaded a patch file for overview.rst and I've uploaded some 2007 test docs.
by , 16 years ago
Attachment: | office2007.patch added |
---|
Patch adds support for .docx .dotx .xlsx .xltx .pptx .potx .ppsx
by , 16 years ago
Attachment: | overview.rst.patch added |
---|
Updated doco for .docx .dotx .xlsx .xltx .pptx .potx .ppsx
comment:7 by , 16 years ago
It seems that "docm", etc are Office 2007:
http://filext.com/file-extension/DOCM
And googling for filetype:docm finds a number of files - I checked one and it is the xml format:
http://www.dangerofchi.org/ninja.docm
So it looks to me like we should add these to the list of understood extensions. Any reason not to?
A diff for the docs is better - thanks for that.
I'll look at applying what's here now.
comment:8 by , 16 years ago
I found a few docm's on the net and tested them, they are put together the same way as the docx's so we could use the same filter by adding another mimetype. But do you think we need to index the macro's as well?
comment:9 by , 16 years ago
I don't think indexing the macros themselves is useful, but my understanding is that these aren't just files full of macros, but macro-enabled documents - i.e. documents with macros which the file extension says it is OK to run (quite a scary concept that the extension or mime-type should be trusted to say that, but that seems to be what these are).
But perhaps I've misunderstood - I didn't find any actual MS documentation of these aside from the lists of mime type mappings. I haven't added support for them for now, but it is easy to do so.
I've merged the other Office 2007 changes (and refactored a little to reduce the code repetition). You'd missed updating the list of mime-types in the docs, but I've addressed that.
And the sample files seems to index correctly, so it all looks good.
So I'm ready to commit, but what copyright notice should I add for your changes? (I don't know if the copyright is owned by you, or an employer or similar).
comment:11 by , 16 years ago
Description: | modified (diff) |
---|---|
Summary: | Omega support for Office 2007 Word and Excel Documents → Omega support for Office 2007 Documents |
OK, applied, and updated the description to list the remaining issues.
comment:12 by , 16 years ago
Description: | modified (diff) |
---|
I've written a quick XpsXmlParser class and wired it up to handle xps files. Seems to work for the example attached here. Committed to trunk as r11904.
comment:13 by , 16 years ago
Description: | modified (diff) |
---|
follow-up: 17 comment:15 by , 16 years ago
Hi,
just wondering : aren't xlsx a sheets a sum of xml files ? I tried to index this kind of file with swish-e some months ago and found out that, instead of one file, there were one file per sheet in the "zip" archive. Wouldn't it be useful to index them all?
comment:16 by , 16 years ago
I don't really know, but currently we index from xl/sharedStrings.xml
inside the zip archive which seems to contain the text in the example xlsx file I have. There's no text in xl/worksheets/sheet1.xml
in this file.
If you have example files where there's useful text in the files in worksheets, please let us have them. Might be less confusing to open a new ticket and attach them to it.
comment:17 by , 16 years ago
Replying to cedric.jeanneret:
Hi,
just wondering : aren't xlsx a sheets a sum of xml files ? I tried to index this kind of file with swish-e some months ago and found out that, instead of one file, there were one file per sheet in the "zip" archive. Wouldn't it be useful to index them all?
In my testing the "strings" are all in xl/sharedStrings.xml, the other xml files contain numbers/formulas which I didn;t think you would want to index.
comment:19 by , 16 years ago
Milestone: | 1.1.1 → 1.1.2 |
---|
comment:20 by , 16 years ago
Description: | modified (diff) |
---|
Added support for macroenabled versions in r12892 (I found an example of each on the web and checked the formats were the same).
comment:21 by , 16 years ago
Description: | modified (diff) |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
Added code to extract pptx notesSlides and comments, if present. That means this ticket can be closed.
While the last two changes could be backported to 1.0.x, I think they are less common and so less important cases, and so I'm not planning to - instead I'd prefer to focus on getting to 1.2.
Thanks. Could you update the documentation to match? And ideally provide some sample files which are redistributable and contain some non-ASCII characters? For more information, see:
http://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat
(Also, as the bug report form suggests, please don't set "keywords" when reporting a bug - its purpose isn't what everyone naturally seems to think it is! One of these days I'll work out how to get trac to hide it...)