How exactly does this recursive regular expression work?

This is a continuation of this issue .

Take a look at this template:

(o(?1)?o) 

It corresponds to any sequence o with a length of 2 n and n โ‰ฅ 1.
It works, see regex101.com (word boundaries added for a better demonstration).
Question: Why?

In the following, the description of the string (whether it is a match or not) will simply be highlighted in bold or bold, which describes the length, for example 2 n .

Broken (with spaces added):

 ( o (?1)? o ) ( ) # Capture group 1 oo # Matches an o each at the start and the end of the group # -> the pattern matches from the outside to the inside. (?1)? # Again the regex of group 1, or nothing. # -> Again one 'o' at the start and one at the end. Or nothing. 

I do not understand why this does not match 2n , but 2 n , because I would describe the pattern as * undefined number oo , stacked in each other.

Visualization:

No recursion, 2 corresponds to:

 oo 

One recursion 4 corresponds to:

 oo oo 

So far so easy.

Two recursions. Obviously wrong, because the pattern does not match 6 :

 oo oo oo 

But why? It seems to fit the pattern.

I conclude that this is not just a simple pattern that repeats because otherwise 6 will have to be matched.

But according to regular-expressions.info :

(?P<name>[abc])(?1)(?P>name) matches three letters, such as (?P<name>[abc])[abc][abc] .

and

[abc])(?1){3} [...] is equivalent to ([abc])[abc]{3}

So it looks like it is simply relaying the regular expression code without information about the previous match of the capture group.

Can someone explain and perhaps imagine why this pattern matches 2 n and nothing more?

Edit:

The comments said:

I doubt that referring to the capture group within oneself is actually a supported case.

regular-expressions.info mentions a technique:

If you place a call inside the group that it calls, you will have a recursive capture group.

+5
source share
2 answers

You correctly understand recursion. Word boundaries do not bother you. \b around the pattern requires that the regex engine matches only the string, if it was not preceded by and accompanied by the word characters.

See how recursion goes here:

 ( o (?1)? o ) => oo 

(?1) then replaced by (o(?1)?o) :

 ( o (?>o(?1)?o)? o ) => oo or oooo 

Then again:

 (o (?>o(?>o(?1)?o)?o)? o) => oo, oooo, oooooo 

Watch the demo version of regex without word boundaries .

Why add (?>...) in the example above? Each recursion level in recursive regular expressions of PHP is atomic , unlike Perl, and as soon as the previous level fails, the engine does not return to the next.

When you add word boundaries, the first o and last o cannot have any other word characters before / after. So ooo will not match , then.

See Recursive regular expressions explained step by step and Word: \b Border at rexegg.com too.

Why is oooooo not the same as a whole, but like oooo and oo ?

Again, each recursion level is atomic. oooooo matches as follows:

  • (o(?1)?o) corresponds to the first o
  • (?1)? expands and the pattern is now (o(?>o(?1)?o)?o) and it matches the second o at the input
  • It continues until (o(?>o(?>o(?>o(?>o(?>o(?>o(?1)?o)?o)?o)?o)?o)?o)?o) will no longer correspond to the input, reverse tracking will occur, we will go to the 6th level,
  • The entire sixth level of recursion also fails, since it cannot match the required amount o s
  • This continues to a level that may correspond to the required amount of o s.

Talk to :

enter image description here

+4
source

This is more or less the answer to Wiktors answer - even after deleting word boundaries, it was difficult for me to understand why oooooo (6) maps to oooo and oo , and ooooooo (7) maps to oooooo .

Here's how it works in detail:

When expanding a recursive template, internal recursions are atomic. With our model, we can deploy it to

 (?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o) 

(In the actual template, this expands again, but this does not change the explanation)

And this is how the strings are matched - first oooooo (6)

 (?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o) o |ooooo <- first o gets matched by first atomic group oo |oooo <- second o accordingly ooo |ooo <- third o accordingly oooo |oo <- fourth o accordingly oooo oo| <- fifth/sixth o by the innermost atomic group ^ <- there is no more o to match, so backtracking starts - innermost ag is not matched, cursor positioned after 4th character oooo xx o |o <- fifth o matches, fourth ag is successfully matched (thus no backtracking into it) oooo xx oo| <- sixth o matches, third ag is successfully matched (thus no backtracking into it) ^ <- no more o, backtracking again - third ag can't be backtracked in, so backtracking into second ag (with matching 3rd 0 times) oo |oo<oo <- third and fourth o close second and first atomic group -> match returned (4 os) 

And now ooooooo (7)

 (?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o) o |oooooo <- first o gets matched by first atomic group oo |ooooo <- second o accordingly ooo |oooo <- third o accordingly oooo |ooo <- fourth o accordingly oooo oo|o <- fifth/sixth o by the innermost atomic group oooo oo o| <- fourth ag is matched successfully (thus no backtracking into it) ^ <- no more o, so backtracking starts here, no backtracking into fourth ag, try again 3rd ooo |ooo<o <- 3rd ag can be closed, as well as second and first -> match returned (6 os) 
+2
source

All Articles