How can I parse email text for components like <greeting> <body> <signature> <response text> etc.?

Question

How can I parse email text for components like <greeting> <body> <signature> <response text> etc.?

I am writing an application that parses emails and it will save me a ton of time if I can use the python library that will parse email text in named components like <salutation><body><signature><reply text> and so on .d.

For example, the following text is " Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ... " will be analyzed as

 Salutation: "Hi Dave,\n" Body: "Lets meet up this Tuesday\n" Signature: "Cheers, Tom\n\n" Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."

I know that there is no perfect solution for this kind of problem, but even a library that works well will help. Where can I find him?

+4

python email email-parsing

Trindaz May 17 '11 at 1:00

source share

3 answers

If you clog each line based on the types of words it contains, you can get a good indication.

eg. The line with welcoming words near the beginning is a greeting (greetings can also have phrases that relate to the past tense, for example, it was nice to see you for the last time)

The body will usually contain words such as "film, concert", etc. It will also contain verbs (walk, run, walk, etc.) and question marks and sentences (for example, if you want, can we, ..). Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation http://ogden.basic-english.org/ http://osteele.com/projects/pywordnet/

the signature will contain closing words.

If you find a data source that has messages about the structure you want, you can perform a frequency analysis to find out how often each word occurs in each section.

Each word will receive a rating [greeting rating, body rating, signature rating, ..] for example, hi can occur 900 times in a greeting, 10 times in a body and 3 times in a signature. this means that hello will be assigned [900, 10, 3, ..] greetings can be assigned [10,3,100, ..]

You will now have a large list of 500,000 words. words that do not have a large range are not useful. for example catch can have [100,101,80 ..] = range 21 (it was good to catch up, I want to catch fish, catch you later). the trick can happen anywhere.

Now you can reduce the number of words to about 10,000

Now for each line, give the line an evaluation also of the form [welcome evaluation, body evaluation, signature evaluation, ..]

this score is calculated by adding vector ratings for each word.

eg. the sentence "hello hello to give me your number" could be: [900, 10, 3, ..] + [10,3,100, ..] + .. + .. + = [900 + 10 + .., 10 + 3 + .., 3 + 100, ..] = [1023,900,500, ..] say

because the largest number is at the beginning of the greeting rating position, this sentence is a greeting.

then if you needed to score one of your lines to see which component should be in the line, for each word that you would add to its account.

Good luck, there is always a trade-off between computational complexity and accuracy. If you can find a good set of words and create a good model to calculate your calculations, this will help.

+3

robert king May 17 '11 at 7:59

source share

The first approach that comes to mind (not necessarily the best ...) should start with the use of split. here is some code and stuff

= emailtext.split sound column ('\ n') now you have an array of lines, each of which looks like a paragraph or something else

therefore linearray [0] will contain a greeting

deciding where the response text begins is a bit more complicated, I noticed that there is a double new line in front of it, so maybe search for it from the back and hope that the latter indicates the beginning of the response text.

Or save some signature words that you might expect and look for them from the front, such as greetings, greetings, and everything else.

Once you figure out where the signature is the rest, the rest is easy

hope this helped

+1

Sheena May 17, '11 at 3:58

source share

Trindaz · Accepted Answer · 2011-05-18T04:13:14+0000

https://github.com/Trindaz/EFZP

This provides the functionality posed in the original question, plus the fair recognition of email zones, as they usually appear in email written by native English speakers from regular email clients such as Outlook and Gmail.

How can I parse email text for components like <greeting> <body> <signature> <response text> etc.?

More articles: