How are nested capture groups numbered in regular expressions?

Question

How are nested capture groups numbered in regular expressions?

Is there a specific behavior for how regular expressions should handle the behavior of capturing nested parentheses? More specifically, can you reasonably expect that different engines will grab external brackets in the first position and nested parentheses in the subsequent positions?

Consider the following PHP code (using PCRE regular expressions)

<?php $test_string = 'I want to test sub patterns'; preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches); print_r($matches); ?> Array ( [0] => I want to test sub patterns //entire pattern [1] => I want to test //entire outer parenthesis [2] => want //first inner [3] => to //second inner [4] => patterns //next parentheses set )

The first expression in parentheses is written first (I want to check), and then the internal bracket patterns are captured as follows (“want” and “so”). This is logical, but I could see an equally logical case for the first capture of auxiliary brackets, and THEN - capture of the entire template.

So, is this exactly what “fixes the whole thing first” a certain behavior in the mechanisms of regular expressions, or will it depend on the context of the template and / or the behavior of the engine (PCRE differs from C #, different from Java differs from, etc.)?

+54

java language-agnostic regex .net perl

Alan Storm Aug 21 '09 at 19:54

source share

4 answers

Yes, this is all very well defined for all languages of interest to you:

Java - http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
"Exciting groups are numbered by counting their opening brackets from left to right. A group zero always indicates the entire expression."
.Net - http://msdn.microsoft.com/en-us/library/bs2twtah(VS.71).aspx
"Captures using () are automatically numbered according to the order of the opening bracket, starting with 1. The first capture, the number of the zero capture element is the text matched by the entire regular expression pattern." )
PHP (PCRE Functions) - http://www.php.net/manual/en/function.preg-replace.php#function.preg-replace.parameters
"\ 0 or $ 0 refers to the text matched by the entire pattern. The opening parentheses are counted from left to right (starting from 1) to get the number of subpatterns of the capture." (This also applies to legacy POSIX features)

PCRE - http://www.pcre.org/pcre.txt
To add to what Alan M said, search for “How pcre_exec () returns the captured substrings” and read the following fifth paragraph:

 The first pair of integers, ovector [0] and ovector [1], identify the
 portion of the subject string matched by the entire pattern.  The next
 pair is used for the first capturing subpattern, and so on.  The value
 returned by pcre_exec () is one more than the highest numbered pair that
 has been set.  For example, if two substrings have been captured, the
 returned value is 3. If there are no capturing subpatterns, the return
 value from a successful match is 1, indicating that just the first pair
 of offsets has been set.

Perl different - http://perldoc.perl.org/perlre.html#Capture-buffers
$ 1, $ 2, etc. Corresponding to capture groups, as one would expect (i.e., by the presence of an opening bracket), however, $ 0 returns the name of the program, and not the entire query string, so that you use $ & instead.

You will most likely find similar results for other languages (Python, Ruby, etc.).

You say that it’s equally logical to display the internal capture groups first, and you're right - it’s just a matter of indexing when closing, not opening, parens. (if I understand you correctly). The implementation of this is less natural (for example, it does not comply with the agreement on the indication for reading), and therefore it becomes more difficult (perhaps not significantly) to determine, by introspection, which capture group will have a given index of the result.

Putting the entire line of a match at position 0 also makes sense - mainly for consistency. It allows the entire consistent line to remain in the same index regardless of the number capture group from the regular expression to the regular expression and regardless of the number of capture groups that actually correspond to something (for example, Java will hide the length of the array of matched groups for each capture; the group does not correspond to any or content (for example, think, for example, as “a (. *) pattern”). You can always check capture_group_results [capturing_group_results_length - 2], but it doesn’t translate languages into Perl that dynamically create variables ($ 1, $ 2, etc.) (Perl is a bad example because it uses $ & for a consistent expression, but you get the idea :).

+14

Alan Donnelly Aug 22 '09 at 3:31

source share

Each flavor of a regular expression I know groups of numbers in the order in which the opening parentheses appear. External groups are numbered before their contained subgroups are a natural result, not an explicit policy.

Where interesting, this is with named groups. In most cases, they follow the same parens relative numbering policy - this name is simply an alias for the number. However, in .NET regular expressions, named groups are numbered separately from numbered groups. For example:

 Regex.Replace(@"one two three four", @"(?<one>\w+) (\w+) (?<three>\w+) (\w+)", @"$1 $2 $3 $4") // result: "two four one three"

In fact, it is an alias for the name; the numbers assigned to named groups begin where the "real" numbered groups remain valid. This may seem like a weird policy, but there is a good reason for this: in .NET regular expressions you can use the same group name more than once in the regular expression. This allows you to use regular expressions such as this stream to match floating point numbers from different locales:

 ^[+-]?[0-9]{1,3} (?: (?:(?<thousand>\,)[0-9]{3})* (?:(?<decimal>\.)[0-9]{2})? | (?:(?<thousand>\.)[0-9]{3})* (?:(?<decimal>\,)[0-9]{2})? | [0-9]* (?:(?<decimal>[\.\,])[0-9]{2})? )$

If there is a thousands separator, it will be stored in the thousand group, regardless of how much of the regular expression matches it. Similarly, the decimal separator (if any) will always be stored in the decimal group. Of course, there are ways to identify and extract separators without reusable name groups, but this method is much more convenient, I think this more than justifies the strange numbering scheme.

And then there is Perl 5.10+, which gives us more control over the capture of groups than I know what to do .: D

+8

Alan Moore Aug 21 '09 at 21:43

source share

The capture order in the order on the left is standard for all platforms I worked on. (perl, php, ruby, egrep)

+4

Devin Ceartas Aug 21 '09 at 19:57

source share

daotoad · Accepted Answer · 2009-08-21 20:00

From perlrequick

If the groupings in the regular expression are nested, $ 1 gets the group with the leftmost opening bracket, $ 2 the next opening bracket, etc.

Update

I don't use PCRE a lot, as I usually use the real thing;), but PCRE Docs show the same thing as Perl:

subpatterns
2. It sets the subpattern as the capture subpattern. This means that when the whole pattern matches, this part of the subject line that matches the subpattern is returned back to the caller through the ovector pcre_exec() argument. Opening parentheses are counted from left to right (starting at 1) to get the number for the capture submatrix.
For example, if the string "red king" matches the pattern
 the ((red|white) (king|queen)) 
the captured substrings are the “red king”, “red” and “king” and are numbered 1, 2 and 3, respectively.

If PCRE deviates from Perl regular expression compatibility, perhaps the abbreviation should be redefined - "Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something else. Or just separate the letters of the value.

How are nested capture groups numbered in regular expressions?

More articles: