Parsing emails

I am writing code to parse forwarded emails. I'm not sure if there is a Python library, some RFCs that I could stick to, or some other resource that would allow me to automate the task.

To be precise, I don’t know if the “layout” of the forwarded letters is covered by some standard or recommendation, or if it has just been developed over the years, so now most email clients produce similar products for the text part:

Begin forwarded message: > From: Me <me@me.me> > Date: January 30, 2010 18:26:33 PM GMT+02:00 > To: Other Me <other-me@me.me> > Subject: Unwise question 

- and is suitable for attachments (and any other sections of MIME may be there).

If it is not yet accurate enough, I’ll clarify it, I’m just not 100% sure what to ask about (RFC, Python lib, convention or something else).

+7
python rfc
source share
4 answers

In my experience, almost always an email client forwards / replies differently. Usually you will have a text version and an html-encoded version in the meme at the bottom of the mail package. The message headers have an RFC ( http://www.faqs.org/rfcs/rfc2822.html "2822" ), but unfortunately the contents of the message body are out of scope.

You should not only struggle with changing the email client, but also with the difference in user preferences. As an example: Lotus Notes places the answers at the top and the Thunderbird answers at the bottom. Therefore, when a Thunderbird user answers a Lotus Notes user’s response, they can insert their answer at the top and leave their signature at the bottom.

Another mistake may be related to word wrap of response chains.

→ → An external response that goes through the limit and the word is broken using the average mail client-responder \ n
→ Message body of the average response
> Previous answer
Newest answer

I would not parse the message and did not leave it for analysis in my head. Or, I would borrow code from another project.

+2
source share

Unlike many other people, there is a standard for forwarded emails, RFC 2046 , "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", over ten years. See, In particular, section 5.2 “Type of message carrier”.

The basic idea of ​​RFC 2046 is to encapsulate one message in the MIME part of another, such as (unfortunately) message/rfc822 (never forget that MIME is recursive). The Python MIME library can handle this.

I did not run the other answers because they are right in one respect: all mail programs do not follow the standard. For example, mutt's mailer can forward a message in RFC 2046 format, but also in adhoc format. Thus, in practice, the mail program probably cannot only process RFC 2046; it also needs to parse various other and underspecified syntaxes.

+5
source share

As the other answers have already been pointed out: there is no standard, and your program will not be perfect.

You can look at the headers, in particular the User-Agent header, to see which client was used, and the code specifically for the most common clients.

To find out which customers you should consider, check out this popularity study . Various Outlook, Yahoo !, Hotmail, Mail.app, iPhone, Gmail and Lotus Notes. About 11% of mail is classified as “undetectable,” but using the forwarded email headers you can do better. Please note that statistics were collected by placing the image inside the email, so the results may be distorted.

Another issue is HTML mail, which may or may not include the plaintext version. I am not sure of the usual customer behavior in this regard.

+2
source share

The standard for reply / forwarding is adding each line of the number of attached messages, including those who sent the original email message, to the client for sorting. So what you need to do in python just add> to the beginning of each line.

 imap Test <imap@gazler.com> Wrote: > >twice >imap Test wrote: >> nested >> >> imap@gazler.com wrote: >>> test >>> >>> -- >>> Message sent via AHEM. >>> >> > 

Attachments simply have to be attached to the message or, as you say, “wild”.

I am not familiar with python, but I believe that the code will be:

 string = string.replace("\n","\n>") 
+1
source share

All Articles