Very slow regex search

I'm not sure I fully understand what happens with the following regular expression search:

>>> import re >>> template = re.compile("(\w+)+\.") >>> target = "a" * 30 >>> template.search(target) 

search() call takes several minutes, CPU utilization - 100%. The behavior is reproduced for both versions 2.7.5 and 3.3.3 in Python.

Interestingly, if the length of the string is less than 20-25 characters, search() returns as in an instant.

What's happening?

+10
performance python string regex
Mar 26 '14 at 2:02
source share
3 answers

Understanding this issue requires an understanding of how NFA works in RegExp.

Developing an NFA definition may be too difficult a mission for me. A search on the NFA on the wiki will give you a better explanation. Here, just think that NFAs are the search robots that you give.

The crudely implemented NFA is a bit dumb, it just looks forward to one or two tokens that you give. So, in the synthetic example you give, the NFA first looks like \w+ (and not the brackets for grouping).

Because + is a greedy quantifier, that is, it matches as many characters as possible, so the NFA frantically continues to use characters in target . After 30 a s, the NFA meets the end of the line. After that, the NFA understands that it needs to map other tokens in the template . The next token is + . NFA has mapped it, so it goes to \. . This time he fails.

What the NFA does next, you need to throw one character back, trying to match the pattern, truncating the fake \w+ . Thus, the NFA divided target into two groups, 29 a into one \w+ and one end of a . At first, the NFA tries to use the final element a, matching it with the second + , but it still fails when the NFA \. encounters \. . The NFA continues the process above until it gets full compliance, otherwise it will try to use all possible sections.

So, (\w+)+\. instructs the NFA to group target in this way: the target is split into one or more groups, at least one character per group, and the target ends with a period of ".". While the period is not agreed. NFA is trying to use all partitions. So how many sections are there? 2 ^ n, exponent 2. (JUst think inserting a separator between a ). As below

 aaaaaaa a aaaaaa aa aaaaaa aa ..... ....... aa aa ... a aaaaa .... a 

If the NFA matches \. it will not hurt. But when it does not fit, this expression is doomed to be infinite exponential.

I do not advertise, but Mastering Regular Expression is a good book for understanding the mechanism in RegExp.

+12
Mar 26 '14 at 2:52
source share

Slowness caused by engine rollback:

 (\w+)+\. 

A reverse search will naturally occur with this pattern if there is no at the end of the line . . First, the engine will try to match as many \w as possible and backtracking when it turns out that a larger number of characters needs to be matched before the character.

 (ax 59) . (ax 58) . ... (a) . 

Finally, he will not match. However, the second + in your template causes the engine to check for possible paths (n-1)! , so:

 (ax 58) (a) . (ax 57) (a) (a) . (ax 57) (ax 2) . ... (a) (a) (a) (a) (a) (a) (a) ... 

Removing + prevent an abnormal amount of backtracking:

 (\w+)\. 

Some implementations will also support possessive quantifiers, which might be more ideal in this particular scenario:

 (\w++)\. 
+5
Mar 26 '14 at 2:48
source share

The second plus causes problems:

 template = re.compile("(\w+)\.") 

works great for me. To see the parse tree for a regular expression, go to re.DEBUG as the second argument to compile:

 import re re.compile("(\w+)+\.", re.DEBUG) print "\n\n" re.compile("(\w+)\.", re.DEBUG) max_repeat 1 65535 subpattern 1 max_repeat 1 65535 in category category_word literal 46 subpattern 1 max_repeat 1 65535 in category category_word literal 46 

Process terminated by exit code 0

This proves that the second plus adds a loop that python parser-regex should close at 65535. This proves my theory somewhat.

Note that to run you will need a new python interpreter for each execution. re.compile remembers the values ​​passed to, so it will not recompile the same regular expression twice, repeatedly running that in ipython, for example, it does not print the parse tree after the first run.

+1
Mar 26 '14 at 2:12
source share



All Articles