#334 closed enhancement (fixed)
Simple .msg filter script
Reported by: | Frank J Bruzzaniti | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.4 |
Component: | Omega | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
I tried using msgconvert.pl but it didn't output to stdout so I wrote this based on it, not sure if it's useful.
Attachments (4)
Change History (15)
by , 16 years ago
Attachment: | msg2txt.pl added |
---|
comment:1 by , 16 years ago
Component: | Other → Omega |
---|
by , 16 years ago
Attachment: | office2007.patch added |
---|
This patch has the .msg filter example (also has 2007 and last mod)
comment:2 by , 16 years ago
I guess you mean as in use in omindex? I uploaded a patch that has the .msg filter in it
comment:3 by , 16 years ago
No, I mean some sample .msg files for testing with (ideally with non-ASCII characters and attachments to check those are handled correctly).
comment:4 by , 16 years ago
Sorry not that I can get to easyly to, I can prob get one next week unless you have one handy.
comment:5 by , 16 years ago
Created a seperate patch with just msg support which I patched against 1.0.12 ok. Remember to put the msg2txt.pl in your bin directory and that you have the correct perl modules installed.
by , 16 years ago
Attachment: | outlook_msg.patch added |
---|
patch to support indexing of Outlook .msg files
comment:6 by , 16 years ago
Thanks for the "single purpose" patch.
There's no need for an external script here, at least as things stand - we can just inline the trivial perl code required like so:
string cmd = "perl -MEmail::Outlook::Message -e 'print new Email::Outlook::Message($ARGV[0])->to_email_mine->as_string' " + shell_protect(file) + " | strings";
If things get significantly more complex, then a separate script might be worthwhile.
Some remaining issues:
- It would still be good to have some sample files for testing, in particular including a few with different character sets and content-encodings, and non-textual attachments.
- Piping the output through
strings
is going to be harmful to anything except ASCII text, and if the issue is binary attachments, it will generally return a lot of junk strings from them anyway. Perhaps we need to recurse theEmail::MIME
subparts to find the text ones?
- If perl is installed, but the
Email::Outlook::Message
module isn't, we get exit code 2, which we don't currently treat as meaning "give up trying this filter", so we'll hammer away fruitlessly if given a lot of.msg
files. I'd suggest we compare the start of the command with"perl "
if we fail with status 2 and handle it as we do status 127.
comment:7 by , 15 years ago
Created a python script that uses perl Email::Outlook::Message module then recurses Email:MIME sub-parts excluding attachments. Needs further testing
comment:8 by , 15 years ago
Using both Python and Perl is a novel approach! Isn't it possible to just use one or the other and avoid the overhead of running both?
Also, you probably want to ignore all content types except text/*, not just application/octet-stream.
comment:9 by , 15 years ago
If sub-part matches text/ it will be converted.
Althought some emails have the same info in text as they do in html so you double up. Also the html still contains HTML TAGS.
To be honest I couldn't find a python lib to read outlook .msg files and I don't know perl.
comment:10 by , 14 years ago
Milestone: | → 1.2.4 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Support for indexing Outlook .msg files committed to trunk in r14962.
I wrote a Perl script to handle this which recurses the MIME subparts of the message and converts text/plain to HTML so the output is all HTML. It handles multipart/alternative in a fairly sane way - it tries each subpart until it gets some text from one - that means text/plain + text/html will be handled by indexing whichever comes first.
Tested on the example .msg files in the Email::Outlook::Message CPAN module sources, though these are rather simple.
comment:11 by , 14 years ago
I noticed this in comment:6:
If perl is installed, but the Email::Outlook::Message module isn't, we get exit code 2, which we don't currently treat as meaning "give up trying this filter", so we'll hammer away fruitlessly if given a lot of .msg files. I'd suggest we compare the start of the command with "perl " if we fail with status 2 and handle it as we do status 127.
I've fixed this in trunk r14964 by making the Perl script exit with status 127 if the modules aren't found, which seems cleaner than what I suggested above.
It's probably worth adding support, especially as the hard work is done by the Perl module.
Do you have any example files with a suitable licence?