Is the regex engine skipping lines that are shorter than the pattern?

Question

Is the regex engine skipping lines that are shorter than the pattern?

I want to loop through a series of lines. On each line, I want to scroll through a set of regular expressions to determine which expressions match the line in which I am included. However, if the line length is shorter than the maximum possible line length of the template, I want the regex engine to skip it.

For example, I stop the string "abc" and test it with this regular expression.

(?i)[AZ]{3}

and it matches. Then my next test expression looks like

 (?i)[AZ]+(?=123)

Will the engine continue to learn the string from the start, even if the second case never matches?

If so, is there a way to get it to skip lines that don't meet the minimum length requirements?

+2

regex .net

erotavlas Sep 16 '15 at 19:50

source share

1 answer

Lucas trzesniewski · Accepted Answer · 2015-09-16T20:59:01+0000

When you are after the implementation details, and when the source code is available, the best way to say it is to just look at it. :)

Short answer: not really.

The optimization implemented in the implementation of the .NET regular expression is the Boyer-Moore search string as the first matching phase whenever possible . Take a look at the source code for gory details.

From the code itself:

 // The RegexBoyerMoore object precomputes the Boyer-Moore // tables for fast string scanning. These tables allow // you to scan for the first occurance of a string within // a large body of text without examining every character. // The performance of the heuristic depends on the actual // string and the text being searched, but usually, the longer // the string that is being searched for, the fewer characters // need to be examined.

This requires the binding of a prefix , the search for which is carried out by this function , the comment of which indicates:

 /* * This is the one of the only two functions that should be called from outside. * It takes a RegexTree and computes the set of chars that can start it. */

The matching algorithm contains code that immediately returns the result without matching if the input string is shorter than the computed prefix.

Note that he is also looking for bindings and optimizations for them, of course.

I did not find the minimum length optimization in the code, but I admit that I have not read it completely _{(I have to do it one day)} . But I know other regular expression implementations that do this kind of optimization (PCRE comes to mind). Anyway, the .NET implementation has its own way of optimizing things, you should rely on it.

Is the regex engine skipping lines that are shorter than the pattern?

More articles: