Ticket #290 (assigned enhancement)

Opened 3 months ago

Last modified 3 months ago

Omega support for Office 2007 Word and Excel Documents

Reported by: frankjb Owned by: olly
Priority: normal Milestone: 1.1.0
Component: Omega Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Operating System: All Blocking:

Description

This patch uses the xmlparser and unzip to extract and process strings from *.xlsx and *.docx files.

P.S. First time I have used svn to create a diff or Trac so forgive me if I've screwed something up :)

Attachments

omindex.diff (2.3 kB) - added by frankjb 3 months ago.

Change History

Changed 3 months ago by frankjb

Changed 3 months ago by olly

  • keywords Office 2007 Excel Word removed
  • status changed from new to assigned

Thanks. Could you update the documentation to match? And ideally provide some sample files which are redistributable and contain some non-ASCII characters? For more information, see:

http://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat

(Also, as the bug report form suggests, please don't set "keywords" when reporting a bug - its purpose isn't what everyone naturally seems to think it is! One of these days I'll work out how to get trac to hide it...)

Changed 3 months ago by olly

  • milestone changed from 1.0.8 to 1.1.0

I've taken a closer look at the patch. It looks good apart from the lack of documentation updates and example files. I'm afraid I don't currently have the time to update the documentation or track down suitable examples myself right now, so I'm moving the milestone to 1.1.0.

While checking the content-types used were appropriate (oddly they aren't listed by IANA, but they are mentioned in posts of blogs.msdn.com so I guess they're OK) I found there are some other formats from which we can probably extract text in the same way.

If you can comment on any of the following, that would help. Otherwise I'll research as time allows.

http://blogs.msdn.com/dmahugh/archive/2006/08/08/692600.aspx lists more extensions and content types:

* Are the weirdly-named "macroEnabled.12" variants compatible formats?

* We handle .dot files, so should handle .dotx too if possible.

* We should handle .ppsx and .pptx if the same approach works (and .ppsm and .pptm if they have the same format).

* Ditto .xps.

http://blogs.msdn.com/ericwhite/pages/the-openxmldocument-class.aspx also mentions "drawings".

Changed 3 months ago by olly

http://www.lesbonscomptes.com/recoll/filters/rclopxml seems to show the paths to look for in a slideshow or presentation.

Note: See TracTickets for help on using tickets.