Effectively match text messages to thousands of regular expressions

I solve a problem when I have a text message matching thousands of regular expressions of a form

<some string> {0 or 300 chars} <some string> {0 or 300 chars}

eg.

"on"[ \t\r]*(.){0,300}"."[ \t\r]*(.){0,300}"from"

or a real example could be

"Dear"[ \t\r]*"Customer,"[ \t\r]*"Your"[ \t\r]*"package"[ \t\r]*(.){0,80}[ \t\r]*"is"[ \t\r]*"out"[ \t\r]*"for"[ \t\r]*"delivery"[ \t\r]*"via"(.){0,80}[ \t\r]*"Courier,"[ \t\r]*(.){0,80}[ \t\r]*"on"(.){0,80}"."[ \t\r]*"Delivery"[ \t\r]*"will"[ \t\r]*"be"[ \t\r]*"attempted"[ \t\r]*"in"[ \t\r]*"5"[ \t\r]*"wkg"[ \t\r]*"days."

First, I used the Java regex engine. I matched the input string with one regex at a time. This process was too slow. I found that the Java regex engine compiles regex in NFAs (non-deterministic state machines), which can slow down due to catastrophic backtracking. So I was thinking about converting regular expressions to DFAs (deterministic finite state machines) using flex-lexer to compile hundreds of regular expressions into a single DFA and thus I would get a match result in O (n), n is the length of the input string. But due to the fixed number of repetitions in the regex, flex takes forever compilation here .

, . ? , , - ( )

"on"[ \t\r]*(.)*"."[ \t\r]*(.)*"from"

. , , ("on", "." and "from") . , flex , , flex .

. ?

+2
1

, (.){0,80}:

"Dear"[ \t\r]*"Customer,"[ \t\r]*"Your"[ \t\r]*"package"[ \t\r]*
(.){0,80}
[ \t\r]*"is"[ \t\r]*"out"[ \t\r]*"for"[ \t\r]*"delivery"[ \t\r]*"via"
(.){0,80}
[ \t\r]*"Courier,"[ \t\r]*
(.){0,80}
[ \t\r]*"on"
(.){0,80}"."
[ \t\r]*"Delivery"[ \t\r]*"will"[ \t\r]*"be"[ \t\r]*"attempted"[ \t\r]*"in"[ \t\r]*"5"[ \t\r]*"wkg"[ \t\r]*"days."

, 80 . , , 79 . 78. 77... , ( , , , 80 0 .{80}?).

.*,

, , .{0,80}?. .

, .

, ,, , , , .

, . , , . .

+1

All Articles