Return to regexp faster than expected

Question

Return to regexp faster than expected

According to perlre , the following code should take a few seconds:

$ time perl -E '$_="((()" . ("a") x 18; say "ok" if m{ \(([^()]+|\( [^()]* \))+\)}x;' real 0m0.006s user 0m0.000s sys 0m0.005s

The documentation says:

See how the above pattern detects a lack of matching ((()aaaaaaaaaaaaaaaaaa after a few seconds, but each additional letter this time doubles.

As you can see, on my laptop it takes only a fraction of a second. Even work with a million a completed in half a second:

 $ time perl -E '$_="((()" . ("a") x 1000000; say "ok" if m{ \(([^()]+|\( [^()]* \))+\)}x;' real 0m0.454s user 0m0.446s sys 0m0.008s

What am I missing here?

+8

regex perl backtracking

Håkon Hægland Aug 20 '15 at 7:41

source share

2 answers

See how the pattern above detects a mismatch on ((()aaaaaaaaaaaaaaaaaa in a few seconds.

This offer dates from at least April 2001, when perl 5.6.1 was released. You can see the perlre man page for this release at search.cpan.org/~gsar/perl-5.6.1/pod/perlre.pod .

This is perhaps the lesson you need to learn here when documenting software. Be careful what you write, as your written words will remain unchanged, despite many years of improvement and multiple forks of others.

+2

David hammen Aug 20 '15 at 16:32

source share

Sobrique · Accepted Answer · 2015-08-20T08:19:26+0000

One of the tricks you can do to figure out what the regular expression engine does is:

 use re 'debug';

eg:.

 use strict; use warnings; use re 'debug'; my $str = "a" x 18; $str =~ m{ \(([^()]+|\( [^()]* \))+\)}x;

Then what the regular expression engine does is printed:

 Compiling REx " \(([^()]+|\( [^()]* \))+\)" Final program: 1: EXACT <(> (3) 3: CURLYX[0] {1,32767} (40) 5: OPEN1 (7) 7: BRANCH (20) 8: PLUS (37) 9: ANYOF[^()][{above_bitmap_all}] (0) 20: BRANCH (FAIL) 21: EXACT <(> (23) 23: STAR (35) 24: ANYOF[^()][{above_bitmap_all}] (0) 35: EXACT <)> (37) 37: CLOSE1 (39) 39: WHILEM[1/3] (0) 40: NOTHING (41) 41: EXACT <)> (43) 43: END (0) anchored "(" at 0 floating ")" at 2..2147483647 (checking floating) minlen 3 Matching REx " \(([^()]+|\( [^()]* \))+\)" against "aaaaaaaaaaaaaaaaaa" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [2..18] gave -1 Did not find floating substr ")"... Match rejected by optimizer Freeing REx: " \(([^()]+|\( [^()]* \))+\)"

If you add your brackets back, you will get a different result - I get about 2000 steps to process the regular expression. This increases with each additional letter - about 300 steps.

So, I would say yes - there is a catastrophic backtracking, but you may find that processors (and optimizing the regex engine) mean that the time is much shorter.

 use strict; use warnings; use re 'debug'; my $str = "((()"."a" x 100_000; $str =~ m{ \(([^()]+|\( [^()]* \))+\)}x;

It works much longer - but, at least, this is due to the fact that printing text on the screen is actually quite "expensive".

My test shows (the number "a")

 10 : 1221 lines of output (broadly correlated with regex steps) 11 : 1324 (+103) 12 : 1467 (+143) 13 : 1590 (+129) 14 : 1728 (+138) 15 : 1852 (+124) 20 : 2630 (approx +155/extra) 30 : 4536 (+190/extra) 40 : 6940 (+240/extra) 50 : 9846 (+290/extra) 100 - 31,846 (+440/extra letter)

So, of course, this looks like exponential behavior, but in any case, the processing time is still pretty fast, which I would attribute to a faster processor (and, possibly, better optimization of the regular expression mechanism)

Return to regexp faster than expected

More articles: