Opened 10 years ago

Closed 9 years ago

Last modified 9 years ago

#334 closed enhancement (fixed)

Simple .msg filter script

Reported by: frankjb Owned by: olly
Priority: normal Milestone: 1.2.4
Component: Omega Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

I tried using msgconvert.pl but it didn't output to stdout so I wrote this based on it, not sure if it's useful.

Attachments (4)

msg2txt.pl (1.0 KB) - added by frankjb 10 years ago.
office2007.patch (6.5 KB) - added by frankjb 10 years ago.
This patch has the .msg filter example (also has 2007 and last mod)
outlook_msg.patch (1.1 KB) - added by frankjb 10 years ago.
patch to support indexing of Outlook .msg files
msg2txt-1-2-5.py (1.6 KB) - added by frankjb 10 years ago.
python script to convert Outlook msg to text

Download all attachments as: .zip

Change History (15)

Changed 10 years ago by frankjb

comment:1 Changed 10 years ago by olly

  • Component changed from Other to Omega

It's probably worth adding support, especially as the hard work is done by the Perl module.

Do you have any example files with a suitable licence?

Changed 10 years ago by frankjb

This patch has the .msg filter example (also has 2007 and last mod)

comment:2 Changed 10 years ago by frankjb

I guess you mean as in use in omindex? I uploaded a patch that has the .msg filter in it

comment:3 Changed 10 years ago by olly

No, I mean some sample .msg files for testing with (ideally with non-ASCII characters and attachments to check those are handled correctly).

comment:4 Changed 10 years ago by frankjb

Sorry not that I can get to easyly to, I can prob get one next week unless you have one handy.

comment:5 Changed 10 years ago by frankjb

Created a seperate patch with just msg support which I patched against 1.0.12 ok. Remember to put the msg2txt.pl in your bin directory and that you have the correct perl modules installed.

Changed 10 years ago by frankjb

patch to support indexing of Outlook .msg files

comment:6 Changed 10 years ago by olly

Thanks for the "single purpose" patch.

There's no need for an external script here, at least as things stand - we can just inline the trivial perl code required like so:

    string cmd = "perl -MEmail::Outlook::Message -e 'print new Email::Outlook::Message($ARGV[0])->to_email_mine->as_string' " + shell_protect(file) + " | strings";

If things get significantly more complex, then a separate script might be worthwhile.

Some remaining issues:

  • It would still be good to have some sample files for testing, in particular including a few with different character sets and content-encodings, and non-textual attachments.
  • Piping the output through strings is going to be harmful to anything except ASCII text, and if the issue is binary attachments, it will generally return a lot of junk strings from them anyway. Perhaps we need to recurse the Email::MIME subparts to find the text ones?
  • If perl is installed, but the Email::Outlook::Message module isn't, we get exit code 2, which we don't currently treat as meaning "give up trying this filter", so we'll hammer away fruitlessly if given a lot of .msg files. I'd suggest we compare the start of the command with "perl " if we fail with status 2 and handle it as we do status 127.

comment:7 Changed 10 years ago by frankjb

Created a python script that uses perl Email::Outlook::Message module then recurses Email:MIME sub-parts excluding attachments. Needs further testing

comment:8 Changed 10 years ago by olly

Using both Python and Perl is a novel approach! Isn't it possible to just use one or the other and avoid the overhead of running both?

Also, you probably want to ignore all content types except text/*, not just application/octet-stream.

Changed 10 years ago by frankjb

python script to convert Outlook msg to text

comment:9 Changed 10 years ago by frankjb

If sub-part matches text/ it will be converted.

Althought some emails have the same info in text as they do in html so you double up. Also the html still contains HTML TAGS.

To be honest I couldn't find a python lib to read outlook .msg files and I don't know perl.

comment:10 Changed 9 years ago by olly

  • Milestone set to 1.2.4
  • Resolution set to fixed
  • Status changed from new to closed

Support for indexing Outlook .msg files committed to trunk in r14962.

I wrote a Perl script to handle this which recurses the MIME subparts of the message and converts text/plain to HTML so the output is all HTML. It handles multipart/alternative in a fairly sane way - it tries each subpart until it gets some text from one - that means text/plain + text/html will be handled by indexing whichever comes first.

Tested on the example .msg files in the Email::Outlook::Message CPAN module sources, though these are rather simple.

comment:11 Changed 9 years ago by olly

I noticed this in comment:6:

If perl is installed, but the Email::Outlook::Message module isn't, we get exit code 2, which we don't currently treat as meaning "give up trying this filter", so we'll hammer away fruitlessly if given a lot of .msg files. I'd suggest we compare the start of the command with "perl " if we fail with status 2 and handle it as we do status 127.

I've fixed this in trunk r14964 by making the Perl script exit with status 127 if the modules aren't found, which seems cleaner than what I suggested above.

Note: See TracTickets for help on using tickets.