A regular expression that never ends

I wrote a small, naive regular expression that was supposed to find text inside parentheses:

re.search(r'\((.|\s)*\)', name)

I know that this is not the best way to do this for several reasons, but it works fine. I'm just looking for an explanation of why for some lines this expression starts exponentially longer and then never ends. Last night, after running this code for several months, one of our servers suddenly got stuck in line with a line similar to the following:

 x (y) z 

I experimented with it and decided that the time taken to double for each space between "y" and "z":

 In [62]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (22 * ' ') + 'z') 1 loops, best of 3: 1.23 s per loop In [63]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (23 * ' ') + 'z') 1 loops, best of 3: 2.46 s per loop In [64]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (24 * ' ') + 'z') 1 loops, best of 3: 4.91 s per loop 

But also that characters other than space do not have the same effect:

 In [65]: %timeit re.search(r'\((.|\s)*\)', 'x (y)' + (24 * 'a') + 'z') 100000 loops, best of 3: 5.23 us per loop 

Note: I am not looking for a better regex or other solution to this problem. We no longer use it.

+8
python regex
source share
2 answers

Catastrophic rollback

As CaffGeek correctly says, the problem is related to one form of catastrophic backtracking. These two alternatives correspond to a space (or tab), and this applies without time limit. In addition, the point coincides with the closing parentheses, so after the initial greenhouse matches, this expression always matches the very end of the line before it must painstakingly retreat to find the closing brackets. And during this backtracking process, another alternative is checked in every place (which is also successful for spaces or tabs). Thus, all possible matching combinations must be checked before the engine can retreat to one position. With a lot of spaces after the closing steam, it folds quickly. The specific problem for the case when there is a corresponding close pair can be solved by simply calculating the star marker (i.e. r'\((.|\s)*?\)' ), But the problem with the regular expression of flight is still exists for a case of inconsistency when there is an open wig with does not match a close pair in the subject line.

The original regex is really, really bad! (and also does not correspond to the coincidence of closing partners in the presence of more than one pair).

The correct expression to match the shortest brackets (which is very fast for matching and non-matching cases), of course:

 re_innermostparens = re.compile(r""" \( # Literal open paren. [^()]* # Zero or more non-parens. \) # Literal close paren. """, re.VERBOSE) 

All regex authors should read MRE3!

This is all explained in great detail (with detailed examples and recommended recommendations) in Jeffrey Friedl, required by regex-authors: Mastering Regular Expressions (3rd edition) . I can honestly say that this is the most useful book I have ever read. Regular expressions are a very powerful tool, but, like loaded weapons, you need to use them with great care and accuracy (or you shoot in the foot!)

+6
source share

Probably a problem with ReDos.

See: http://en.wikipedia.org/wiki/ReDoS

You can create a regular expression for denial of service on poorly constructed regular expressions.

For example, from here: https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS

This is the regex ^(a+)+$

enter image description here

There are 16 possible ways to enter aaaaX in the above chart. But for aaaaaaaaaaaaaaaaX there are 65,536 possible paths, and the number for each additional a doubles. This is an extreme case where a naive algorithm is problematic because it has to go through many paths and then fails.

I suspect that most of the problem with your regular expression is (.|\s) , which is somewhat confusing since \s already included in . . But creates an option point.

EDIT: I think this was one of the best explanations for the problem I read.

http://msdn.microsoft.com/en-us/magazine/ff646973.aspx

+5
source share

All Articles