Opened 12 years ago

Closed 11 years ago

#290 closed enhancement (fixed)

Omega support for Office 2007 Documents

Reported by: Frank J Bruzzaniti Owned by: Olly Betts
Priority: normal Milestone: 1.1.2
Component: Omega Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Support for Word, Excel, Powerpoint committed to trunk as r11866; XPS as r11904; macroenabled versions in r12892; extracting of ppt notes and comments (if present) in r12893.


Original description:

This patch uses the xmlparser and unzip to extract and process strings from *.xlsx and *.docx files.

P.S. First time I have used svn to create a diff or Trac so forgive me if I've screwed something up :)

Attachments (7)

omindex.diff (2.3 KB ) - added by Frank J Bruzzaniti 12 years ago.
ms2007.patch (3.0 KB ) - added by Frank J Bruzzaniti 12 years ago.
Here's a new patch that includes support for Power Point (.pptx)
overview.rst (18.8 KB ) - added by Frank J Bruzzaniti 12 years ago.
I've updated the doco, not sure if it was ment to be a diff. If so let me know
office2007.patch (6.5 KB ) - added by Frank J Bruzzaniti 12 years ago.
Patch adds support for .docx .dotx .xlsx .xltx .pptx .potx .ppsx
overview.rst.patch (994 bytes ) - added by Frank J Bruzzaniti 12 years ago.
Updated doco for .docx .dotx .xlsx .xltx .pptx .potx .ppsx
2007_Test_Docs.tar.gz (186.7 KB ) - added by Frank J Bruzzaniti 12 years ago.
Test files to test filter
OpenSearchをサポートする検索エンジンとソフトウェア.xps (138.9 KB ) - added by Frank J Bruzzaniti 12 years ago.
XPS test file

Download all attachments as: .zip

Change History (28)

by Frank J Bruzzaniti, 12 years ago

Attachment: omindex.diff added

comment:1 by Olly Betts, 12 years ago

Keywords: Office 2007 Excel Word removed
Status: newassigned

Thanks. Could you update the documentation to match? And ideally provide some sample files which are redistributable and contain some non-ASCII characters? For more information, see:

http://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat

(Also, as the bug report form suggests, please don't set "keywords" when reporting a bug - its purpose isn't what everyone naturally seems to think it is! One of these days I'll work out how to get trac to hide it...)

comment:2 by Olly Betts, 12 years ago

Milestone: 1.0.81.1.0

I've taken a closer look at the patch. It looks good apart from the lack of documentation updates and example files. I'm afraid I don't currently have the time to update the documentation or track down suitable examples myself right now, so I'm moving the milestone to 1.1.0.

While checking the content-types used were appropriate (oddly they aren't listed by IANA, but they are mentioned in posts of blogs.msdn.com so I guess they're OK) I found there are some other formats from which we can probably extract text in the same way.

If you can comment on any of the following, that would help. Otherwise I'll research as time allows.

http://blogs.msdn.com/dmahugh/archive/2006/08/08/692600.aspx lists more extensions and content types:

  • Are the weirdly-named "macroEnabled.12" variants compatible formats?
  • We handle .dot files, so should handle .dotx too if possible.
  • We should handle .ppsx and .pptx if the same approach works (and .ppsm and .pptm if they have the same format).
  • Ditto .xps.

http://blogs.msdn.com/ericwhite/pages/the-openxmldocument-class.aspx also mentions "drawings".

comment:3 by Olly Betts, 12 years ago

http://www.lesbonscomptes.com/recoll/filters/rclopxml seems to show the paths to look for in a slideshow or presentation.

comment:4 by Olly Betts, 12 years ago

Milestone: 1.1.01.1.1

Bumping milestone to 1.1.1 as this is ready to apply and isn't an incompatible change.

by Frank J Bruzzaniti, 12 years ago

Attachment: ms2007.patch added

Here's a new patch that includes support for Power Point (.pptx)

by Frank J Bruzzaniti, 12 years ago

Attachment: overview.rst added

I've updated the doco, not sure if it was ment to be a diff. If so let me know

in reply to:  2 comment:5 by Frank J Bruzzaniti, 12 years ago

Replying to olly:

I've taken a closer look at the patch. It looks good apart from the lack of documentation updates and example files. I'm afraid I don't currently have the time to update the documentation or track down suitable examples myself right now, so I'm moving the milestone to 1.1.0.

While checking the content-types used were appropriate (oddly they aren't listed by IANA, but they are mentioned in posts of blogs.msdn.com so I guess they're OK) I found there are some other formats from which we can probably extract text in the same way.

If you can comment on any of the following, that would help. Otherwise I'll research as time allows.

http://blogs.msdn.com/dmahugh/archive/2006/08/08/692600.aspx lists more extensions and content types:

  • Are the weirdly-named "macroEnabled.12" variants compatible formats?
  • We handle .dot files, so should handle .dotx too if possible.
  • We should handle .ppsx and .pptx if the same approach works (and .ppsm and .pptm if they have the same format).
  • Ditto .xps.

http://blogs.msdn.com/ericwhite/pages/the-openxmldocument-class.aspx also mentions "drawings".

XPS format is do-able, it's very similar.

Office 2007 mimetypes are here: http://blogs.msdn.com/vsofficedeveloper/pages/Office-2007-Open-XML-MIME-Types.aspx

Can't find any mention of a "macroEnabled.12" variant for any of the "openxmlformats" like docx: I'll see if I can filter this lot: .docx .dotx .xlsx .xltx .pptx .potx .ppsx

BTW for test documents if I use some text in jap and english would that work?

comment:6 by Frank J Bruzzaniti, 12 years ago

I've created a patch which contains support for .docx .dotx .xlsx .xltx .pptx .potx .ppsx It's also has msg and last_mod patched into it. Do I need to separate into another patch?

I've also uploaded a patch file for overview.rst and I've uploaded some 2007 test docs.

by Frank J Bruzzaniti, 12 years ago

Attachment: office2007.patch added

Patch adds support for .docx .dotx .xlsx .xltx .pptx .potx .ppsx

by Frank J Bruzzaniti, 12 years ago

Attachment: overview.rst.patch added

Updated doco for .docx .dotx .xlsx .xltx .pptx .potx .ppsx

by Frank J Bruzzaniti, 12 years ago

Attachment: 2007_Test_Docs.tar.gz added

Test files to test filter

comment:7 by Olly Betts, 12 years ago

It seems that "docm", etc are Office 2007:

http://filext.com/file-extension/DOCM

And googling for filetype:docm finds a number of files - I checked one and it is the xml format:

http://www.dangerofchi.org/ninja.docm

So it looks to me like we should add these to the list of understood extensions. Any reason not to?

A diff for the docs is better - thanks for that.

I'll look at applying what's here now.

comment:8 by Frank J Bruzzaniti, 12 years ago

I found a few docm's on the net and tested them, they are put together the same way as the docx's so we could use the same filter by adding another mimetype. But do you think we need to index the macro's as well?

comment:9 by Olly Betts, 12 years ago

I don't think indexing the macros themselves is useful, but my understanding is that these aren't just files full of macros, but macro-enabled documents - i.e. documents with macros which the file extension says it is OK to run (quite a scary concept that the extension or mime-type should be trusted to say that, but that seems to be what these are).

But perhaps I've misunderstood - I didn't find any actual MS documentation of these aside from the lists of mime type mappings. I haven't added support for them for now, but it is easy to do so.

I've merged the other Office 2007 changes (and refactored a little to reduce the code repetition). You'd missed updating the list of mime-types in the docs, but I've addressed that.

And the sample files seems to index correctly, so it all looks good.

So I'm ready to commit, but what copyright notice should I add for your changes? (I don't know if the copyright is owned by you, or an employer or similar).

comment:10 by Frank J Bruzzaniti, 12 years ago

I'm unemployed atm so you can copyright Frank J Bruzzaniti

comment:11 by Olly Betts, 12 years ago

Description: modified (diff)
Summary: Omega support for Office 2007 Word and Excel DocumentsOmega support for Office 2007 Documents

OK, applied, and updated the description to list the remaining issues.

comment:12 by Olly Betts, 12 years ago

Description: modified (diff)

I've written a quick XpsXmlParser class and wired it up to handle xps files. Seems to work for the example attached here. Committed to trunk as r11904.

comment:13 by Olly Betts, 12 years ago

Description: modified (diff)

comment:14 by Olly Betts, 12 years ago

Backported Office 2007 (r12096) and XPS (r12097) support for 1.0.11.

comment:15 by Cédric Jeanneret, 12 years ago

Hi,

just wondering : aren't xlsx a sheets a sum of xml files ? I tried to index this kind of file with swish-e some months ago and found out that, instead of one file, there were one file per sheet in the "zip" archive. Wouldn't it be useful to index them all?

comment:16 by Olly Betts, 12 years ago

I don't really know, but currently we index from xl/sharedStrings.xml inside the zip archive which seems to contain the text in the example xlsx file I have. There's no text in xl/worksheets/sheet1.xml in this file.

If you have example files where there's useful text in the files in worksheets, please let us have them. Might be less confusing to open a new ticket and attach them to it.

in reply to:  15 comment:17 by Frank J Bruzzaniti, 12 years ago

Replying to cedric.jeanneret:

Hi,

just wondering : aren't xlsx a sheets a sum of xml files ? I tried to index this kind of file with swish-e some months ago and found out that, instead of one file, there were one file per sheet in the "zip" archive. Wouldn't it be useful to index them all?

In my testing the "strings" are all in xl/sharedStrings.xml, the other xml files contain numbers/formulas which I didn;t think you would want to index.

comment:18 by Cédric Jeanneret, 12 years ago

oh, ok! maybe I was on the wrong way when doing this. Sorry ;)

comment:19 by Olly Betts, 11 years ago

Milestone: 1.1.11.1.2

comment:20 by Olly Betts, 11 years ago

Description: modified (diff)

Added support for macroenabled versions in r12892 (I found an example of each on the web and checked the formats were the same).

comment:21 by Olly Betts, 11 years ago

Description: modified (diff)
Resolution: fixed
Status: assignedclosed

Added code to extract pptx notesSlides and comments, if present. That means this ticket can be closed.

While the last two changes could be backported to 1.0.x, I think they are less common and so less important cases, and so I'm not planning to - instead I'd prefer to focus on getting to 1.2.

Note: See TracTickets for help on using tickets.