Each flavor of a regular expression I know groups of numbers in the order in which the opening parentheses appear. External groups are numbered before their contained subgroups are a natural result, not an explicit policy.
Where interesting, this is with named groups. In most cases, they follow the same parens relative numbering policy - this name is simply an alias for the number. However, in .NET regular expressions, named groups are numbered separately from numbered groups. For example:
Regex.Replace(@"one two three four", @"(?<one>\w+) (\w+) (?<three>\w+) (\w+)", @"$1 $2 $3 $4") // result: "two four one three"
In fact, it is an alias for the name; the numbers assigned to named groups begin where the "real" numbered groups remain valid. This may seem like a weird policy, but there is a good reason for this: in .NET regular expressions you can use the same group name more than once in the regular expression. This allows you to use regular expressions such as this stream to match floating point numbers from different locales:
^[+-]?[0-9]{1,3} (?: (?:(?<thousand>\,)[0-9]{3})* (?:(?<decimal>\.)[0-9]{2})? | (?:(?<thousand>\.)[0-9]{3})* (?:(?<decimal>\,)[0-9]{2})? | [0-9]* (?:(?<decimal>[\.\,])[0-9]{2})? )$
If there is a thousands separator, it will be stored in the thousand group, regardless of how much of the regular expression matches it. Similarly, the decimal separator (if any) will always be stored in the decimal group. Of course, there are ways to identify and extract separators without reusable name groups, but this method is much more convenient, I think this more than justifies the strange numbering scheme.
And then there is Perl 5.10+, which gives us more control over the capture of groups than I know what to do .: D
Alan Moore Aug 21 '09 at 21:43 2009-08-21 21:43
source share