Is alternation faster than subsequent replacements in regular expressions

Question

Is alternation faster than subsequent replacements in regular expressions

I have a pretty simple question. Where I work, I see a lot of regular expressions. They are used in Perl to replace and / or eliminate some lines in the text, for example:

$string=~s/^.+\///; $string=~s/\.shtml//; $string=~s/^ph//;

I understand that you cannot combine the first and last replacements because you may want to replace ph at the beginning of the line after the first change. However, I would put the first and second regular expression along with alternation: $string=~s/(^.+\/|\.shtml)//; Since we process thousands of files (+500,000), I was wondering which method is most efficient.

+6

regex perl alternation

Bram vanroy Apr 05 '16 at 8:06

source share

6 answers

How regular expression interlacing is implemented in Perl is pretty well explained in perldoc perlre

Compliance with one or another
We can match different character strings with alternating metacharacter '|' . To match dog or cat , we form the regular expression dog|cat . As before, Perl will try to match the regex as soon as possible to a dot in the string. At each character position, Perl will first try to match the first alternative to dog . If dog does not match, Perl will then try the next alternative, cat . If cat does not match, then the match fails, and Perl moves to the next position in the line. Some examples:
 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 
Even if dog is the first alternative in the second regular expression, cat may match earlier in the string.
 "cats" =~ /c|ca|cat|cats/; # matches "c" "cats" =~ /cats|cat|ca|c/; # matches "cats" 
Here, all alternatives correspond to the first position of the line, so the first alternative is the one that matches. If some of the alternatives are truncations of others, put the longest ones first to give them a chance to match.
 "cab" =~ /a|b|c/ # matches "c" # /a|b|c/ == /[abc]/ 
The final example indicates that character classes are similar to character interleaving. Given a character position, the first alternative that allows you to combine a regular expression for success will match the one that matches.

So this should explain the price you pay when using regular rotation.

When combining a simple regular expression, you do not pay that price. This is well explained in another related question in SO. When directly searching for a constant string or character set, as in the question, optimizations can be made and no backtracking is required, which means faster code.

When defining alternation of regular expressions, simply choosing a good order (first finding the most common results) can affect performance. Not that you choose one of two options, or twenty. As always, premature optimization is the root of all evil, and you should measure the code ( Devel :: NYTProf ) if there are problems or you want improvements. But, as a rule, alternation should be minimized and, if possible, avoided:

They easily make the regular expression too large. We like the simple, easy to understand / debugging / supporting regular expression.
Variability and dependence of input. They can be an unexpected source of problems because they are backtrack and can lead to an unexpected lack of performance depending on your input. As far as I understand, there is no case when they will be faster.
Conceptually, you are trying to compare two different things, so we can say that two different operators are more correct and understandable than just one.

Hope this answer comes close to what you expected.

+3

eballes Apr 13 '16 at 20:10

source share

First, measure the various parameters of your real data, because no theory will beat the experiment (if this can be done). There are many synchronization modules in CPAN that will help you.

Secondly, if you decide to optimize regular expressions, do not crush them into one giant monster manually, try to collect the "main" regular expression with code. Otherwise, no one can decrypt the code.

+1

zoul Apr 05 '16 at 8:15

source share

The combination is not the best option

If you have three regular expressions that work perfectly, there is no use to combining them. Not only do they rewrite them, open the door for errors, it makes reading and programming a programmer and engine difficult.

This page assumes that you need to change instead:

 while (<FILE>) { next if (/^(?:select|update|drop|insert|alter)\b/); ... }

You should use:

 while (<FILE>) { next if (/^select/); next if (/^update/); ... }

You can optimize regular expressions

You can use regex objects that ensure that your regex is not recompiled in a loop:

 my $query = qr/foo$bar/; #compiles here @matches = ( ); ... while (<FILE>) { push @matches, $_ if /$query/i; }

You can also optimize .+ . It eats up the whole file, and then should return the character by character until it finds / so that it can match. If there is only one / in the file, try a negative character class, for example: [^/] (escaped: [^\/] ). Where do you expect to find / in your file? Knowing this will make your regular expression faster.

Substitution rate depends on other factors.

If you have performance issues (currently with 3 regular expressions), this may be another part of your program. While computer processing speeds increased exponentially, read and write speeds increased slightly.

There may be faster mechanisms for finding and replacing text in a file.

Perl uses NFA, which is slower but more powerful than the DFA processor. NFA (especially with changes) and has the worst exponential time period. DFA has linear runtime. Your templates do not need an NFA engine, so you can use your regular expressions in a DFA engine, such as sed, very easily.

According to this, sed can search and replace at a rate of 82.1 million characters processed per second (note that this test was written to /dev/null , so the speed of writing to the hard drive was not really a factor).

+1

Laurel Apr 14 '16 at 17:39

source share

Maybe something is off topic, but if actual replacements are rare, compared to the number of compares (10% -20%?), You can get some speed using the first index match

 $string=~s/\.shtml// if index($string, ".shtml");

0

Ftlie Apr 16 '16 at 10:50

source share

The second method is best when you put the first and second regular expression along with alternation. Because in this method, perl performs a one-time move and validates both expressions.

If you use the first method, in which perl should go separately for both expressions.

Consequently, the number of cycles decreased in the second method.

-2

mukesh bhoj Apr 05 '16 at 9:29

source share

ThisSuitIsBlackNot · Accepted Answer · 2016-04-13T23:05:39+0000

Your expressions are not equivalent

It:

 $string=~s/^.+\///; $string=~s/\.shtml//;

replaces the text .shtml and all the way to the last slash.

It:

 $string=~s/(^.+\/|\.shtml)//;

replaces the text .shtml or all, up to the last slash.

This is one problem with combining regular expressions: a complex complex regular expression is harder to write, harder to understand, and harder to debug than a few simple ones.

It probably doesn't matter which is faster

Even if your expressions are equivalent, using one or the other will probably not have a significant impact on the speed of your program. In-memory operations, such as s/// , are significantly faster than I / O files, and you indicated that you do a lot of file I / O.

You must profile your application with Devel :: NYTProf to see if these specific replacements are really a bottleneck (I doubt they exist). Do not waste time optimizing things that are already fast.

Alternatives discourage optimizer

Keep in mind that you are comparing apples and oranges, but if you are still interested in learning about performance, you can see how perl evaluates a certain regular expression using re pragma :

 $ perl -Mre=debug -e'$_ = "foobar"; s/^.+\///; s/\.shtml//;' ... Guessing start of match in sv for REx "^.+/" against "foobar" Did not find floating substr "/"... Match rejected by optimizer Guessing start of match in sv for REx "\.shtml" against "foobar" Did not find anchored substr ".shtml"... Match rejected by optimizer Freeing REx: "^.+/" Freeing REx: "\.shtml"

The regex engine has an optimizer. The optimizer looks for substrings that should be displayed in the target string; if these substrings cannot be found, the match is not executed immediately without checking the other parts of the regular expression.

With /^.+\// optimizer knows that $string must contain at least one forward slash to match; when he does not find any slashes, he immediately rejects the match, without invoking the full mechanism of regular expressions. A similar optimization occurs with /\.shtml/ .

Here is what perl does with combined regex:

 $ perl -Mre=debug -e'$_ = "foobar"; s/(?:^.+\/|\.shtml)//;' ... Matching REx "(?:^.+/|\.shtml)" against "foobar" 0 <> <foobar> | 1:BRANCH(7) 0 <> <foobar> | 2: BOL(3) 0 <> <foobar> | 3: PLUS(5) REG_ANY can match 6 times out of 2147483647... failed... 0 <> <foobar> | 7:BRANCH(11) 0 <> <foobar> | 8: EXACT <.shtml>(12) failed... BRANCH failed... 1 <f> <oobar> | 1:BRANCH(7) 1 <f> <oobar> | 2: BOL(3) failed... 1 <f> <oobar> | 7:BRANCH(11) 1 <f> <oobar> | 8: EXACT <.shtml>(12) failed... BRANCH failed... 2 <fo> <obar> | 1:BRANCH(7) 2 <fo> <obar> | 2: BOL(3) failed... 2 <fo> <obar> | 7:BRANCH(11) 2 <fo> <obar> | 8: EXACT <.shtml>(12) failed... BRANCH failed... 3 <foo> <bar> | 1:BRANCH(7) 3 <foo> <bar> | 2: BOL(3) failed... 3 <foo> <bar> | 7:BRANCH(11) 3 <foo> <bar> | 8: EXACT <.shtml>(12) failed... BRANCH failed... 4 <foob> <ar> | 1:BRANCH(7) 4 <foob> <ar> | 2: BOL(3) failed... 4 <foob> <ar> | 7:BRANCH(11) 4 <foob> <ar> | 8: EXACT <.shtml>(12) failed... BRANCH failed... 5 <fooba> <r> | 1:BRANCH(7) 5 <fooba> <r> | 2: BOL(3) failed... 5 <fooba> <r> | 7:BRANCH(11) 5 <fooba> <r> | 8: EXACT <.shtml>(12) failed... BRANCH failed... Match failed Freeing REx: "(?:^.+/|\.shtml)"

Notice how much longer the output. Due to interleaving, the optimizer does not start, and the full regular expression mechanism is executed. In the worst case (no match), each part of the rotation is checked for every character in the string. It is not very effective.

So alternation is slower, isn't it? No, because...

It depends on your data.

Again, we compare apples and oranges, but with:

 $string = 'a/really_long_string';

combined regex can be faster because with s/\.shtml// optimizer needs to scan most of the string before rejecting the match, while combined regex will quickly match.

You can benchmark this for fun, but it is practically pointless, as you are comparing different things.

Is alternation faster than subsequent replacements in regular expressions

Your expressions are not equivalent

It probably doesn't matter which is faster

Alternatives discourage optimizer

It depends on your data.

The combination is not the best option

You can optimize regular expressions

Substitution rate depends on other factors.

There may be faster mechanisms for finding and replacing text in a file.

More articles: