What does this mean that a faster failure occurs with the atomic group

NOTE : - The question is a little long, as it contains a section from the book.

I read about atomic groups from Mastering Regular Expressions .

atomic groups given to result in a faster error . Quoting this particular section from the book

Faster crashes with atomic grouping. Consider ^\w+: applied to a Subject . We can see, simply by looking at him, that he will fail because there is no colon in the text, but the regular expression mechanism will not reach this conclusion until it passes through the verification movements.

So, by the time : first checked, \w+ will go to the end of the line. This leads to states - one skip me state for each \w match plus (except for the first, since plus requires one match). When at the end of the line : the failure fails, so the regex mechanism returns to the last saved state:

enter image description here

which indicates that : does not work again, this time trying to match t . This reverse path test completes, everything returns to its oldest state:

enter image description here

After trying out the final state of failure, a general failure can finally be declared. All this rollback is a big job, which after a simple glance to know that it is not necessary. If the coincidence of the gut after the last letter, it certainly does not correspond to one of the letters, which + forcibly give up!

So, knowing that none of the states remaining on \w+ , once its completion, possibly leads to a match, we can save a regular expression; the engine cannot check them: ^(?>\w+): By adding atomic grouping, we use our global knowledge of regular expression for local work \w+ by saving saved states (which we know is useless) thrown away. If there is a coincidence, the atomic grouping will not matter, but if they do not match, throwing out useless states allow the regular expression to come to this conclusion more quickly.


I tried this regex here . For ^\w+: it took 4 steps and 6 steps for ^(?>\w+): (with internal engine optimization disabled)


My questions

  • The second paragraph of the above section indicates that

So, by the time : first checked, \w+ will go to the end of the line. This leads to many states - one misses me for each \w match in plus (except for the first, since plus requires one match). Then, after the end of the line : the failure fails, so the regex mechanism returns to the last saved state:

enter image description here

which indicates that : does not work again, this time trying to match t . This reverse path test completes, everything returns to its oldest state:

enter image description here

but to this site, I do not see a return. What for?

Is there any kind of optimization inside (even after turning it off)?

  1. Can the number of steps performed by a regular expression decide whether one regular expression has good performance compared to another regular expression?
+7
regex pcre
source share
1 answer

The debugger on this site seems to mask the return details. RegexBuddy does a better job. Here is what it shows for ^\w+:

Normal (greedy)

After \w+ consumes all the letters, it tries to match : and fails. Then it returns one character, tries again : and works again. And so on, until there is nothing left to give. Only fifteen. Now consider the atomic version ( ^(?>\w+): :

Atomic

After an unsuccessful match with : for the first time, it returns all the letters at once, as if they were a single character. Only five steps, and two of them enter and leave the group. And using possessive quantifiers ( ^\w++: eliminates even those:

Possessive

As for your second question, yes, a measure of the number of steps from regular expression debuggers is useful, especially if you're just learning regular expressions. Each regex has at least a few optimizations that even poorly written regexes can perform adequately, but a debugger (especially taste-neutral like RegexBuddy's) makes it obvious when you do something wrong.

+3
source share

All Articles