Compare text with anomaly detection pattern (reverse pattern)

I am looking for an algorithm, or even an algorithm space, that addresses the problem of verifying that short text (email) matches known patterns. The coding will probably be python or perl, but it is flexible.

Here's the problem:

Servers that have access to production data should be able to send e-mail, which will be available on the Internet:

Dear John Smith, We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges: $12.34 Spuznitz, LLC on 4/1 $43.21 1-800-FLOWERS on 4/2 As always, you can view these transactions in our portal. Thank you for your business! 

Obviously, some of the emails will be different - a greeting ("John Smith"), "$ 123.45 for 2/4/13" and lines with printed transactions. Other parts ("We received your last payment") are very static. I want to be able to match the static parts of the text and quantify that the dynamic parts are within certain reasonable limits (I might know that most of the transaction lines to be printed, for example, 5).

Since I am worried about data filtering, I want to make sure that email that does not match this pattern never goes out - I want to examine the email and quarantine everything that does not look like what I expect. Therefore, I need to automate this template and block any email messages that are far enough from the match.

So the question is where to look for a filtering mechanism? Bayesian filtering attempts to verify that there is sufficient similarity between a specific message and a non-specific case, which is the opposite problem. Things like the Perl template module are dense, but for output, not input or comparison. Simple comparisons of diff types do not do well with limited dynamic information.

How can I check if these outgoing emails are quacking like a duck?

+8
python validation text perl templates
source share
3 answers

You can use grammars for tight matching. To simplify the abstraction, you can organize regular expressions in grammars: http://www.effectiveperlprogramming.com/blog/1479

Or you can use the dedicated Marpa grammar engine .

If you want a more statistical approach, consider n-gram . First, add text and replace variable chunks with meaningful placeholders, such as CURRENCY and DATE . Then build n-grams. Now you can use the Jaccard index to compare two texts.

Here is a Pure-Perl implementation that works on trigrams:

 #!/usr/bin/env perl use strict; use utf8; use warnings; my $ngram1 = ngram(3, tokenize(<<'TEXT1')); Dear John Smith, We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges: $12.34 Spuznitz, LLC on 4/1 $43.21 1-800-FLOWERS on 4/2 As always, you can view these transactions in our portal. Thank you for your business! TEXT1 my $ngram2 = ngram(3, tokenize(<<'TEXT2')); Dear Sally Bates, We received your last payment for $456.78 on 6/9/12. We'd like you to be aware of the following charges: $123,43 Gnomovision on 10/1 As always, you can view these transactions in our portal. Thank you for your business! TEXT2 my %intersection = map { exists $ngram1->[2]{$_} ? ($_ => 1) : () } keys %{$ngram2->[2]}; my %union = map { $_ => 1 } keys %{$ngram1->[2]}, keys %{$ngram2->[2]}; printf "Jaccard similarity coefficient: %0.3f\n", keys(%intersection) / keys(%union); sub tokenize { my @words = split m{\s+}x, lc shift; for (@words) { s{\d{1,2}/\d{1,2}(?:/\d{2,4})?}{ DATE }gx; s{\d+(?:\,\d{3})*\.\d{1,2}}{ FLOAT }gx; s{\d+}{ INTEGER }gx; s{\$\s(?:FLOAT|INTEGER)\s}{ CURRENCY }gx; s{^\W+|\W+$}{}gx; } return @words; } sub ngram { my ($size, @words) = @_; --$size; my $ngram = []; for (my $j = 0; $j <= $#words; $j++) { my $k = $j + $size <= $#words ? $j + $size : $#words; for (my $l = $j; $l <= $k; $l++) { my @buf; for my $w (@words[$j..$l]) { push @buf, $w; } ++$ngram->[$#buf]{join(' ', @buf)}; } } return $ngram; } 

You can use one text as a template and match it with letters. Check String :: Trigram for an efficient implementation. The Google Ngram Viewer is a good resource for illustrating n-gram matching.

+4
source share

If you want to map an existing template, for example. control the elements of the stream, such as {% for x in y %} , from the expected output from it, you will have to analyze the template language - which seems like a lot of work.

On the other hand, if you are ready to write a second template for verification purposes - something like:

 Dear {{customer}}, We received your last payment for {{currency}} on {{full-date}}\. We'd like you to be aware of the following charges: ( {{currency}} {{supplier}} on {{short-date}} ){,5}As always, you can view these transactions in our portal\. 

..., which is just an extension of the syntax of regular expressions, it is quite simple to hack something together, which confirms this:

 import re FIELDS = { "customer": r"[\w\s\.-]{,50}", "supplier": r"[\w\s\.,-]{,30}", "currency": r"[$€£]\d+\.\d{2}", "short-date": r"\d{,2}/\d{,2}", "full-date": r"\d{,2}/\d{,2}/\d{2}", } def validate(example, template_file): with open(template_file) as f: template = f.read() for tag, pattern in FIELDS.items(): template = template.replace("{{%s}}" % tag, pattern) valid = re.compile(template + "$") return (re.match(valid, example) is not None) 

The above example is not the biggest Python code of all time by any means, but it is enough to get a general idea.

+3
source share

I would go for the "longest common subsequence." The standard implementation can be found here:

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence

If you need a better algorithm and / or many additional ideas for inaccurate string matching, the standard link is a book:

http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198

Do not fool the headline. Compuatational biology is mainly concerned with matching a large database with long strings (also known as DNA sequences).

+2
source share

All Articles