Regex for extracting new content from email body

Given a line representing the entire text of the email text, I would like to extract only the part that the sender sent if it is only one continuous block of text. For instance:

Dear Sir: That is a good point. On Wednesday, June 1, John wrote: > Hello world. 

Will select:

 Dear Sir: That is a good point. 

By adjacent, I mean that a block can contain one new line, but not consecutive lines of a new line. So this will not match:

 Dear Sir: That is a good point. On Wednesday, June 1, John wrote: > Hello world. 

For the part that the sender sent, I mean that the body of the letter may contain a response or redirected text or signature, all of which I want to exclude (let me call it "non-original content"). Although there may be many variations in the wild, it would be sufficient (for now) to handle only the following cases:

1) a line starting with two dashes (for example: ----- forwarded message -----), because signatures also often have two hyphens at the beginning of the line

2) a line starting with "On", followed by a line starting with ">" to catch this format:

 On Wednesday, June 1, John wrote: > Hello world. 

If there is nothing above the aftermarket block (without white space), then there should be no match.

Finally, keep in mind that there can be any number of spaces at the beginning of a message, as well as between the target text block and the end of the message, or between the target text block and the beginning of non-original content. Also, keep in mind that email carriage returns can only be line feeds or crlfs.

This is my first attempt, which is getting closer than I thought when I started writing this; it uses the s flag:

 ^\s*(\S[^(?:\n\n|\r\n\r\n)]*\S)\s*(?:$|(?:$|\-\-.*|On [^\n]*\n\>.*)) 

From my testing, it still works if the target text is just one line, but no more than one line. Thus, the main drawback in this part:

 _______[^(?:\n\n|\r\n\r\n)]*________________________________________ 

UPDATE : this is the solution I am using:

 '/\A\s*((?:[^\r\n]+\r?(?:\n|\z))+)\s*(?:\z|(--.*|On .+:\n\>.*))/s' 

Please note that the string "On" can be wrapped over several lines (for example, if the date and email address are long), but in general there will be ": \ n>".

+4
source share
3 answers

In the part that you noted:

 [^(?:\n\n|\r\n\r\n)]* 

The braces indicate the character class, and the carat inverts the characters so that they match. Therefore, I suppose that the regex engine builds a character class that does not match ( does not match ? Does not match : etc.

Here is a regex that I believe does what you want for this part:

 ((?:[^\r\n]+\r?\n)*) 

This means that “matches anything but CR or LF, any number, but at least one, and then optionally CR, and then definitely LF. Then, when it is repeated using * (zero or more times), it wins 't matches the two ends of a line in a line, because the beginning of a pattern is nothing but a line ending in. Then all this in parens to create a matching group.

Now we need to bind it so that it matches where you want. It looks like you are expecting three anchor cases: the end of the line, the line "On is written" or the signature line ("- \ n"). Your regular expression is more complex than necessary to consolidate these three cases; this would do:

 (?:$|--\r?\n|On \d\d/\d\d/\d\d\d\d \d\d:\d\d [AP]M, .*wrote:\r?\n) 

This is longer than yours because I wanted to make sure that it would not be attached to the actual text of the email message that begins with the word “On” at the beginning of the line.

And you allow any number of empty lines between the correspondence group and the anchor:

 (?:\r?\n)* 

Add them together:

 ((?:[^\r\n]+\r?\n)*)(?:\r?\n)*(?:$|--\r?\n|On \d\d/\d\d/\d\d\d\d \d\d:\d\d [AP]M, .*wrote:\r?\n) 

I tested them using the actual email from my inbox, using the Python re module to test the regular expression.

NOTE. Actually, now that I am thinking about it, I do not recommend using such a strict regular expression to match the string "On". The string "On" is inserted by the email client used by the sender, and you do not control it. What if a user's email client inserts a 24-hour time instead of AM / PM? (I even saw French people’s email clients insert French instead of “On,” so the whole line won’t even match!) So you may need a template for a stronger match for the “On” line, but be careful if it is too loose and the email contains a line that starts with "On", which you could break earlier.

Here is a simple template that should work:

 On \d[^\n]+\n> 

On, then a digit, and then everything to the end of the line, but the next line should start with > . This should work, except in the pathological case where the email body has a line starting with "On" and a number, and then the next line starts with the word "From", so the email client inserts > before "From".

In any case, all together:

 ((?:[^\r\n]+\r?\n)*)(?:\r?\n)*(?:$|--\r?\n|On \d[^\n]+\n>) 

EDIT: you asked me to do a quick edit and update it with your final template, so here you are:

 /\A\s*((?:[^\r\n]+\r?(?:\n|\z))+)\s*(?:\z|(--.*|On [^\n]+\n\>.*))/s 
+3
source

/^(?!>|On|--)(.*)+/m must match any line that does not start with On,> or -

0
source

Using JavaScript .match() should fit all your test cases:

 /((.|[\r\n])+?)([\r\n][\r\n]|On.+[\r\n]\>|--)/ 

This means: start the regular expression / followed by a character or a newline ( .|[\r\n] ) one or more times ( + ) immodestly ( ? ), Followed by two lines of a newline ( [\r\n\r\n] ) or "On. newline> 'or' - '( [\r\n][\r\n]|On.+[\r\n]\>|-- ), followed by the ends regular expressions ( / ).

The first grouping is the line you are after.

See the demo here: http://jsfiddle.net/57L5t/

0
source

All Articles