Opened 11 years ago

Closed 10 years ago

#290 closed enhancement (fixed)

Omega support for Office 2007 Documents

Reported by: frankjb Owned by: olly
Priority: normal Milestone: 1.1.2
Component: Omega Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by olly)

Support for Word, Excel, Powerpoint committed to trunk as r11866; XPS as r11904; macroenabled versions in r12892; extracting of ppt notes and comments (if present) in r12893.


Original description:

This patch uses the xmlparser and unzip to extract and process strings from *.xlsx and *.docx files.

P.S. First time I have used svn to create a diff or Trac so forgive me if I've screwed something up :)

Attachments (7)

omindex.diff (2.3 KB) - added by frankjb 11 years ago.
ms2007.patch (3.0 KB) - added by frankjb 11 years ago.
Here's a new patch that includes support for Power Point (.pptx)
overview.rst (18.8 KB) - added by frankjb 11 years ago.
I've updated the doco, not sure if it was ment to be a diff. If so let me know
office2007.patch (6.5 KB) - added by frankjb 11 years ago.
Patch adds support for .docx .dotx .xlsx .xltx .pptx .potx .ppsx
overview.rst.patch (994 bytes) - added by frankjb 11 years ago.
Updated doco for .docx .dotx .xlsx .xltx .pptx .potx .ppsx
2007_Test_Docs.tar.gz (186.7 KB) - added by frankjb 11 years ago.
Test files to test filter
OpenSearchをサポートする検索エンジンとソフトウェア.xps (138.9 KB) - added by frankjb 11 years ago.
XPS test file

Download all attachments as: .zip

Change History (28)

Changed 11 years ago by frankjb

comment:1 Changed 11 years ago by olly

  • Keywords Office 2007 Excel Word removed
  • Status changed from new to assigned

Thanks. Could you update the documentation to match? And ideally provide some sample files which are redistributable and contain some non-ASCII characters? For more information, see:

http://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat

(Also, as the bug report form suggests, please don't set "keywords" when reporting a bug - its purpose isn't what everyone naturally seems to think it is! One of these days I'll work out how to get trac to hide it...)

comment:2 follow-up: Changed 11 years ago by olly

  • Milestone changed from 1.0.8 to 1.1.0

I've taken a closer look at the patch. It looks good apart from the lack of documentation updates and example files. I'm afraid I don't currently have the time to update the documentation or track down suitable examples myself right now, so I'm moving the milestone to 1.1.0.

While checking the content-types used were appropriate (oddly they aren't listed by IANA, but they are mentioned in posts of blogs.msdn.com so I guess they're OK) I found there are some other formats from which we can probably extract text in the same way.

If you can comment on any of the following, that would help. Otherwise I'll research as time allows.

http://blogs.msdn.com/dmahugh/archive/2006/08/08/692600.aspx lists more extensions and content types:

  • Are the weirdly-named "macroEnabled.12" variants compatible formats?
  • We handle .dot files, so should handle .dotx too if possible.
  • We should handle .ppsx and .pptx if the same approach works (and .ppsm and .pptm if they have the same format).
  • Ditto .xps.

http://blogs.msdn.com/ericwhite/pages/the-openxmldocument-class.aspx also mentions "drawings".

comment:3 Changed 11 years ago by olly

http://www.lesbonscomptes.com/recoll/filters/rclopxml seems to show the paths to look for in a slideshow or presentation.

comment:4 Changed 11 years ago by olly

  • Milestone changed from 1.1.0 to 1.1.1

Bumping milestone to 1.1.1 as this is ready to apply and isn't an incompatible change.

Changed 11 years ago by frankjb

Here's a new patch that includes support for Power Point (.pptx)

Changed 11 years ago by frankjb

I've updated the doco, not sure if it was ment to be a diff. If so let me know

comment:5 in reply to: ↑ 2 Changed 11 years ago by frankjb

Replying to olly:

I've taken a closer look at the patch. It looks good apart from the lack of documentation updates and example files. I'm afraid I don't currently have the time to update the documentation or track down suitable examples myself right now, so I'm moving the milestone to 1.1.0.

While checking the content-types used were appropriate (oddly they aren't listed by IANA, but they are mentioned in posts of blogs.msdn.com so I guess they're OK) I found there are some other formats from which we can probably extract text in the same way.

If you can comment on any of the following, that would help. Otherwise I'll research as time allows.

http://blogs.msdn.com/dmahugh/archive/2006/08/08/692600.aspx lists more extensions and content types:

  • Are the weirdly-named "macroEnabled.12" variants compatible formats?
  • We handle .dot files, so should handle .dotx too if possible.
  • We should handle .ppsx and .pptx if the same approach works (and .ppsm and .pptm if they have the same format).
  • Ditto .xps.

http://blogs.msdn.com/ericwhite/pages/the-openxmldocument-class.aspx also mentions "drawings".

XPS format is do-able, it's very similar.

Office 2007 mimetypes are here: http://blogs.msdn.com/vsofficedeveloper/pages/Office-2007-Open-XML-MIME-Types.aspx

Can't find any mention of a "macroEnabled.12" variant for any of the "openxmlformats" like docx: I'll see if I can filter this lot: .docx .dotx .xlsx .xltx .pptx .potx .ppsx

BTW for test documents if I use some text in jap and english would that work?

comment:6 Changed 11 years ago by frankjb

I've created a patch which contains support for .docx .dotx .xlsx .xltx .pptx .potx .ppsx It's also has msg and last_mod patched into it. Do I need to separate into another patch?

I've also uploaded a patch file for overview.rst and I've uploaded some 2007 test docs.

Changed 11 years ago by frankjb

Patch adds support for .docx .dotx .xlsx .xltx .pptx .potx .ppsx

Changed 11 years ago by frankjb

Updated doco for .docx .dotx .xlsx .xltx .pptx .potx .ppsx

Changed 11 years ago by frankjb

Test files to test filter

comment:7 Changed 11 years ago by olly

It seems that "docm", etc are Office 2007:

http://filext.com/file-extension/DOCM

And googling for filetype:docm finds a number of files - I checked one and it is the xml format:

http://www.dangerofchi.org/ninja.docm

So it looks to me like we should add these to the list of understood extensions. Any reason not to?

A diff for the docs is better - thanks for that.

I'll look at applying what's here now.

comment:8 Changed 11 years ago by frankjb

I found a few docm's on the net and tested them, they are put together the same way as the docx's so we could use the same filter by adding another mimetype. But do you think we need to index the macro's as well?

comment:9 Changed 11 years ago by olly

I don't think indexing the macros themselves is useful, but my understanding is that these aren't just files full of macros, but macro-enabled documents - i.e. documents with macros which the file extension says it is OK to run (quite a scary concept that the extension or mime-type should be trusted to say that, but that seems to be what these are).

But perhaps I've misunderstood - I didn't find any actual MS documentation of these aside from the lists of mime type mappings. I haven't added support for them for now, but it is easy to do so.

I've merged the other Office 2007 changes (and refactored a little to reduce the code repetition). You'd missed updating the list of mime-types in the docs, but I've addressed that.

And the sample files seems to index correctly, so it all looks good.

So I'm ready to commit, but what copyright notice should I add for your changes? (I don't know if the copyright is owned by you, or an employer or similar).

comment:10 Changed 11 years ago by frankjb

I'm unemployed atm so you can copyright Frank J Bruzzaniti

comment:11 Changed 11 years ago by olly

  • Description modified (diff)
  • Summary changed from Omega support for Office 2007 Word and Excel Documents to Omega support for Office 2007 Documents

OK, applied, and updated the description to list the remaining issues.

comment:12 Changed 11 years ago by olly

  • Description modified (diff)

I've written a quick XpsXmlParser? class and wired it up to handle xps files. Seems to work for the example attached here. Committed to trunk as r11904.

comment:13 Changed 11 years ago by olly

  • Description modified (diff)

comment:14 Changed 11 years ago by olly

Backported Office 2007 (r12096) and XPS (r12097) support for 1.0.11.

comment:15 follow-up: Changed 11 years ago by cedric.jeanneret

Hi,

just wondering : aren't xlsx a sheets a sum of xml files ? I tried to index this kind of file with swish-e some months ago and found out that, instead of one file, there were one file per sheet in the "zip" archive. Wouldn't it be useful to index them all?

comment:16 Changed 11 years ago by olly

I don't really know, but currently we index from xl/sharedStrings.xml inside the zip archive which seems to contain the text in the example xlsx file I have. There's no text in xl/worksheets/sheet1.xml in this file.

If you have example files where there's useful text in the files in worksheets, please let us have them. Might be less confusing to open a new ticket and attach them to it.

comment:17 in reply to: ↑ 15 Changed 11 years ago by frankjb

Replying to cedric.jeanneret:

Hi,

just wondering : aren't xlsx a sheets a sum of xml files ? I tried to index this kind of file with swish-e some months ago and found out that, instead of one file, there were one file per sheet in the "zip" archive. Wouldn't it be useful to index them all?

In my testing the "strings" are all in xl/sharedStrings.xml, the other xml files contain numbers/formulas which I didn;t think you would want to index.

comment:18 Changed 11 years ago by cedric.jeanneret

oh, ok! maybe I was on the wrong way when doing this. Sorry ;)

comment:19 Changed 10 years ago by olly

  • Milestone changed from 1.1.1 to 1.1.2

comment:20 Changed 10 years ago by olly

  • Description modified (diff)

Added support for macroenabled versions in r12892 (I found an example of each on the web and checked the formats were the same).

comment:21 Changed 10 years ago by olly

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from assigned to closed

Added code to extract pptx notesSlides and comments, if present. That means this ticket can be closed.

While the last two changes could be backported to 1.0.x, I think they are less common and so less important cases, and so I'm not planning to - instead I'd prefer to focus on getting to 1.2.

Note: See TracTickets for help on using tickets.