Very slow regex search

Question

Very slow regex search

I'm not sure I fully understand what happens with the following regular expression search:

>>> import re >>> template = re.compile("(\w+)+\.") >>> target = "a" * 30 >>> template.search(target)

search() call takes several minutes, CPU utilization - 100%. The behavior is reproduced for both versions 2.7.5 and 3.3.3 in Python.

Interestingly, if the length of the string is less than 20-25 characters, search() returns as in an instant.

What's happening?

+10

performance python string regex

alecxe Mar 26 '14 at 2:02

source share

3 answers

Slowness caused by engine rollback:

 (\w+)+\.

A reverse search will naturally occur with this pattern if there is no at the end of the line . . First, the engine will try to match as many \w as possible and backtracking when it turns out that a larger number of characters needs to be matched before the character.

 (ax 59) . (ax 58) . ... (a) .

Finally, he will not match. However, the second + in your template causes the engine to check for possible paths (n-1)! , so:

 (ax 58) (a) . (ax 57) (a) (a) . (ax 57) (ax 2) . ... (a) (a) (a) (a) (a) (a) (a) ...

Removing + prevent an abnormal amount of backtracking:

 (\w+)\.

Some implementations will also support possessive quantifiers, which might be more ideal in this particular scenario:

 (\w++)\.

+5

Ja͢ck Mar 26 '14 at 2:48

source share

The second plus causes problems:

 template = re.compile("(\w+)\.")

works great for me. To see the parse tree for a regular expression, go to re.DEBUG as the second argument to compile:

 import re re.compile("(\w+)+\.", re.DEBUG) print "\n\n" re.compile("(\w+)\.", re.DEBUG) max_repeat 1 65535 subpattern 1 max_repeat 1 65535 in category category_word literal 46 subpattern 1 max_repeat 1 65535 in category category_word literal 46

Process terminated by exit code 0

This proves that the second plus adds a loop that python parser-regex should close at 65535. This proves my theory somewhat.

Note that to run you will need a new python interpreter for each execution. re.compile remembers the values passed to, so it will not recompile the same regular expression twice, repeatedly running that in ipython, for example, it does not print the parse tree after the first run.

+1

Simon Mar 26 '14 at 2:12

source share

Herrington Darkholme · Accepted Answer · 2014-03-26 02:52

Understanding this issue requires an understanding of how NFA works in RegExp.

Developing an NFA definition may be too difficult a mission for me. A search on the NFA on the wiki will give you a better explanation. Here, just think that NFAs are the search robots that you give.

The crudely implemented NFA is a bit dumb, it just looks forward to one or two tokens that you give. So, in the synthetic example you give, the NFA first looks like \w+ (and not the brackets for grouping).

Because + is a greedy quantifier, that is, it matches as many characters as possible, so the NFA frantically continues to use characters in target . After 30 a s, the NFA meets the end of the line. After that, the NFA understands that it needs to map other tokens in the template . The next token is + . NFA has mapped it, so it goes to \. . This time he fails.

What the NFA does next, you need to throw one character back, trying to match the pattern, truncating the fake \w+ . Thus, the NFA divided target into two groups, 29 a into one \w+ and one end of a . At first, the NFA tries to use the final element a, matching it with the second + , but it still fails when the NFA \. encounters \. . The NFA continues the process above until it gets full compliance, otherwise it will try to use all possible sections.

So, (\w+)+\. instructs the NFA to group target in this way: the target is split into one or more groups, at least one character per group, and the target ends with a period of ".". While the period is not agreed. NFA is trying to use all partitions. So how many sections are there? 2 ^ n, exponent 2. (JUst think inserting a separator between a ). As below

 aaaaaaa a aaaaaa aa aaaaaa aa ..... ....... aa aa ... a aaaaa .... a

If the NFA matches \. it will not hurt. But when it does not fit, this expression is doomed to be infinite exponential.

I do not advertise, but Mastering Regular Expression is a good book for understanding the mechanism in RegExp.

Very slow regex search

More articles: