Why does my Python regex verify that more than one group takes so long?

This question comes up in a Django address, but the problem seems to be common.

I want to map URLs constructed as follows:

1,2,3,4,5,6/10,11,12/ 

I use regex:

 ^(?P<apples>([0123456789]+,?)+)/(?P<oranges>([0123456789]+,?)+)/$ 

When I try to match it with a "valid" URL (i.e. matching), I get an instant match:

 In [11]: print datetime.datetime.now(); re.compile(r"^(?P<apples>([0123456789]+,?)+)/(?P<oranges>([0123456789]+,?)+)/$").search("114,414,415,416,417,418,419,420,113,410,411,412,413/1/"); print datetime.datetime.now() 2011-10-18 14:27:42.087883 Out[11]: <_sre.SRE_Match object at 0x2ab0960> 2011-10-18 14:27:42.088145 

However, when I try to match an "invalid" URL (inconsistency), the entire regular expression takes a time to return nothing:

 In [12]: print datetime.datetime.now(); re.compile(r"^(?P<apples>([0123456789]+,?)+)/(?P<oranges>([0123456789]+,?)+)/").search("114,414,415,416,417,418,419,420,113,410,411,412,413/"); print datetime.datetime.now() 2011-10-18 14:29:21.011342 2011-10-18 14:30:00.573270 

I guess there is something in the regexp engine that slows down when multiple groups need to be matched. Is there a workaround for this? Maybe my regex should be fixed?

+4
source share
2 answers

This is a known flaw in many regex engines, including Python and Perl. What happens is that the engine backs off and gets an exponential explosion of possible coincidences. Improved regex mechanisms do not use backtracking for such a simple regex.

You can fix this by getting rid of the extra comma. This is what allows the engine to look at a string of type 123 and decide whether to analyze it as (123) or (12)(3) or (1)(23) or (1)(2)(3) . It is a lot of coincidences to try only three digits, so you can see how quickly it explodes into a couple of dozen digits.

 ^(?P<apples>[0-9]+(,[0-9]+)*)/(?P<oranges>[0-9]+(,[0-9]+)*)/$ 

This will make the regex engine always group 123,456 as (123),(456) and never like (12)(3),(4)(56) or anything else. Since it will correspond to only one method, the back tracking mechanism will not collide with a combinatorial explosion of possible analyzes. Again, the best regular expression mechanisms do not suffer from this drawback.

Update: If I wrote this, I would do it as follows:

 ^(?P<apples>[0-9,]+)/(?P<oranges>[0-9,]+)$ 

This will match multiple dummy URLs (e.g., ,/, ), but you can always return 404 after you parse and break it.

 try: apples = [int(x) for x in apples.split(',')] except ValueError: # return 404 error 
+5
source

You can use this regex:

 ^(?P<apples>(?:\d+,)*\d+)/(?P<oranges>(?:\d+,)*\d+)/$ 

\ d matches a digit

+2
source

All Articles