Opened 15 years ago

Closed 14 years ago

Last modified 14 years ago

#334 closed enhancement (fixed)

Simple .msg filter script

Reported by: Frank J Bruzzaniti Owned by: Olly Betts
Priority: normal Milestone: 1.2.4
Component: Omega Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

I tried using msgconvert.pl but it didn't output to stdout so I wrote this based on it, not sure if it's useful.

Attachments (4)

msg2txt.pl (1.0 KB ) - added by Frank J Bruzzaniti 15 years ago.
office2007.patch (6.5 KB ) - added by Frank J Bruzzaniti 15 years ago.
This patch has the .msg filter example (also has 2007 and last mod)
outlook_msg.patch (1.1 KB ) - added by Frank J Bruzzaniti 15 years ago.
patch to support indexing of Outlook .msg files
msg2txt-1-2-5.py (1.6 KB ) - added by Frank J Bruzzaniti 15 years ago.
python script to convert Outlook msg to text

Download all attachments as: .zip

Change History (15)

by Frank J Bruzzaniti, 15 years ago

Attachment: msg2txt.pl added

comment:1 by Olly Betts, 15 years ago

Component: OtherOmega

It's probably worth adding support, especially as the hard work is done by the Perl module.

Do you have any example files with a suitable licence?

by Frank J Bruzzaniti, 15 years ago

Attachment: office2007.patch added

This patch has the .msg filter example (also has 2007 and last mod)

comment:2 by Frank J Bruzzaniti, 15 years ago

I guess you mean as in use in omindex? I uploaded a patch that has the .msg filter in it

comment:3 by Olly Betts, 15 years ago

No, I mean some sample .msg files for testing with (ideally with non-ASCII characters and attachments to check those are handled correctly).

comment:4 by Frank J Bruzzaniti, 15 years ago

Sorry not that I can get to easyly to, I can prob get one next week unless you have one handy.

comment:5 by Frank J Bruzzaniti, 15 years ago

Created a seperate patch with just msg support which I patched against 1.0.12 ok. Remember to put the msg2txt.pl in your bin directory and that you have the correct perl modules installed.

by Frank J Bruzzaniti, 15 years ago

Attachment: outlook_msg.patch added

patch to support indexing of Outlook .msg files

comment:6 by Olly Betts, 15 years ago

Thanks for the "single purpose" patch.

There's no need for an external script here, at least as things stand - we can just inline the trivial perl code required like so:

    string cmd = "perl -MEmail::Outlook::Message -e 'print new Email::Outlook::Message($ARGV[0])->to_email_mine->as_string' " + shell_protect(file) + " | strings";

If things get significantly more complex, then a separate script might be worthwhile.

Some remaining issues:

  • It would still be good to have some sample files for testing, in particular including a few with different character sets and content-encodings, and non-textual attachments.
  • Piping the output through strings is going to be harmful to anything except ASCII text, and if the issue is binary attachments, it will generally return a lot of junk strings from them anyway. Perhaps we need to recurse the Email::MIME subparts to find the text ones?
  • If perl is installed, but the Email::Outlook::Message module isn't, we get exit code 2, which we don't currently treat as meaning "give up trying this filter", so we'll hammer away fruitlessly if given a lot of .msg files. I'd suggest we compare the start of the command with "perl " if we fail with status 2 and handle it as we do status 127.

comment:7 by Frank J Bruzzaniti, 15 years ago

Created a python script that uses perl Email::Outlook::Message module then recurses Email:MIME sub-parts excluding attachments. Needs further testing

comment:8 by Olly Betts, 15 years ago

Using both Python and Perl is a novel approach! Isn't it possible to just use one or the other and avoid the overhead of running both?

Also, you probably want to ignore all content types except text/*, not just application/octet-stream.

by Frank J Bruzzaniti, 15 years ago

Attachment: msg2txt-1-2-5.py added

python script to convert Outlook msg to text

comment:9 by Frank J Bruzzaniti, 15 years ago

If sub-part matches text/ it will be converted.

Althought some emails have the same info in text as they do in html so you double up. Also the html still contains HTML TAGS.

To be honest I couldn't find a python lib to read outlook .msg files and I don't know perl.

comment:10 by Olly Betts, 14 years ago

Milestone: 1.2.4
Resolution: fixed
Status: newclosed

Support for indexing Outlook .msg files committed to trunk in r14962.

I wrote a Perl script to handle this which recurses the MIME subparts of the message and converts text/plain to HTML so the output is all HTML. It handles multipart/alternative in a fairly sane way - it tries each subpart until it gets some text from one - that means text/plain + text/html will be handled by indexing whichever comes first.

Tested on the example .msg files in the Email::Outlook::Message CPAN module sources, though these are rather simple.

comment:11 by Olly Betts, 14 years ago

I noticed this in comment:6:

If perl is installed, but the Email::Outlook::Message module isn't, we get exit code 2, which we don't currently treat as meaning "give up trying this filter", so we'll hammer away fruitlessly if given a lot of .msg files. I'd suggest we compare the start of the command with "perl " if we fail with status 2 and handle it as we do status 127.

I've fixed this in trunk r14964 by making the Perl script exit with status 127 if the modules aren't found, which seems cleaner than what I suggested above.

Note: See TracTickets for help on using tickets.