How can I speed up Perl regex matching?

I want to write multiple texts using the following regular expression:

$text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$}; 

A sample string is as follows:

 my $text = '/F12345 FF FF this is SCF SF really MV (important stuff SH'; 

Can I rewrite to speed up the comparison?

+4
source share
5 answers

There is no single answer to regex optimization. You can see what a particular regex does with re pragma:

  use re 'debugcolor'; 

As soon as you see that it crosses the line, you see where it has problems, and adjust your regular expression. You will learn a little about the regex engine, how you do it.

You should also check out Mastering Regular Expressions , which explains how regular expressions work and why some patterns are slower than others.

+17
source

This is very dependent on the profile of the data you are viewing.

What you can do is identify the part of your RE that filters out most of the input and make a separate simpler RE for this expression.

For example, if only 5% of the entered date contains the string 'MV' you can filter for this in the first place and apply only a more complex RE, if a simpler one is true

So you will have:

 if ( $text_normal =~ / MV / ) { $text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$}; if ....... } } 
+6
source

Not seeing some example data, it's hard to say.

Avoiding use is generally recommended .* . Look for all possible sources of unnecessary reverse tracking and eliminate them.

You can get away with split with a slice if your needs are simple.

  my @vals = (split / /, $string)[0,2,5,7]; 
+6
source

(.*) means that you are dealing with any number of repetitions of “SCF SF” before you find the one that indicates the next capture, making it inanimate, you are still processing the possibilities that even “SCF SF 'will appear in the capture after 'FF.' I think you handle a lot of cases that you don't need.

The best way to optimize a regular expression sometimes makes it more cryptic - but you definitely find ways to make the expression fail earlier. (.*?) without being "greedy" is definitely too tolerant.

Below is a more detailed, but faster alternative to the second capture.

 ((?:[^S]|S[^C]|SC[^F]|SCF[^ ]|SCF [^S]|SCF S[^F])*) 

But you can optimize it even more if you think that the line \bSCF\b should automatically commit and expect only "\ bSCF SF \ b". So you can rewrite this as:

 ((?:[^S]|S[^C]|SC[^F]SCF\B)*) SCF SF 

But you can optimize these lines even more by controlling backtracking. If you think that in the world there is no way for SCF to ever appear as a word, and it is not followed by SF at the actual input. To do this, you add another group to it with brackets (?> And ) .

 (?>((?:[^S]|S[^C]|SC[^F]SCF\B)*)) SCF SF 

This means that the logic of correspondence will in no way try to overestimate what it has captured. If the characters after that are not “SCF SF”, the whole expression stops working. And that ends long before he ever tries to place MV and other subexpressions.

In fact, given certain expressions about the uniqueness of delimiters, the highest performance for this expression will be:

 $text_normal = qr{^(\/F\d+) FF (?>((?:[^S]|S[^C]|SC[^F]SCF\B)*))SCF SF (?>((?:[^M]|M[^V]|MV\B)*))MV (?>(\((?:[^S]|S[^H]|SH.)*))SH$}; 

In addition, detailed, exhaustive negative matches may be alternative, pronounced negative images - but I don't know how this works for performance. But a negative look will work as follows:

 ((?:.(?! SCF))*) SCF SF 

This means that for this capture, I need any character that is not a space starting with the string "SCF SF".

+2
source

Backtracking is one of the most reliable ways to kill the performance of regular expressions, but unfortunately it does not look like you can completely eliminate the wildcard . in favor of character classes, if only the text is missing "Capture is forbidden to contain uppercase characters. (If this prohibition exists, you can replace .*? say, [az ]* . instead). You can still reduce the chance of a return using {} to set the minimum / maximum number of characters to match, for example, as .{0,10}? if the match cannot be more than 10 characters.

0
source

All Articles