Java Matcher Groups: Understanding The Difference Between "(?: X | Y)" and "(?: X) | (?: Y)"

Can anyone explain:

  • Why do the two patterns used below give different results? (below)
  • Why does the second example give the number of groups 1, but says the beginning and end of group 1 is -1?
public void testGroups() throws Exception { String TEST_STRING = "After Yes is group 1 End"; { Pattern p; Matcher m; String pattern="(?:Yes|No)(.*)End"; p=Pattern.compile(pattern); m=p.matcher(TEST_STRING); boolean f=m.find(); int count=m.groupCount(); int start=m.start(1); int end=m.end(1); System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + " Start of group 1=" + start + " End of group 1=" + end ); } { Pattern p; Matcher m; String pattern="(?:Yes)|(?:No)(.*)End"; p=Pattern.compile(pattern); m=p.matcher(TEST_STRING); boolean f=m.find(); int count=m.groupCount(); int start=m.start(1); int end=m.end(1); System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + " Start of group 1=" + start + " End of group 1=" + end ); } } 

Which gives the following conclusion:

 Pattern=(?:Yes|No)(.*)End Found=true Group count=1 Start of group 1=9 End of group 1=21 Pattern=(?:Yes)|(?:No)(.*)End Found=true Group count=1 Start of group 1=-1 End of group 1=-1 
+7
java regex regex-group
source share
4 answers

Summarizing,

1) These two patterns give different results due to operator precedence rules.

  • (?:Yes|No)(.*)End matches (yes or no) followed by. * End
  • (?:Yes)|(?:No)(.*)End matches (yes) or (does not follow. * End)

2) The second template gives the number of groups 1, but the beginning and end are -1 due to (not necessarily intuitive) values โ€‹โ€‹of the results returned by calls to the Matcher method.

  • Matcher.find() returns true if a match is found. In your case, the match was in the (?:Yes) template.
  • Matcher.groupCount() returns the number of captured groups in the template, regardless of whether the capture groups actually participated in the match. In your case, only the non-capturing (?:Yes) part of the template participated in the match, but the capturing group (.*) Was still part of the template, so the number of groups is 1.
  • Matcher.start(n) and Matcher.end(n) return the beginning and end index of the subsequence consistent with the nth capture group. In your case, although a common match was found, the capture group (.*) Did not participate in the match and therefore did not record a subsequence, therefore, the results are -1.

3) (The question is asked in the comment.) To determine how many capture groups actually captured the subsequence, iterate Matcher.start(n) from 0 to Matcher.groupCount() , counting the number of results without -1. (Note that Matcher.start(0) is a capture group representing the entire template that you can exclude for your purposes.)

+4
source share
  • The difference is that in the second pattern, "(?:Yes)|(?:No)(.*)End" concatenation ("X followed by Y" in "XY") takes precedence over selection (" Either X or Y "in" X | Y "), for example, multiplication has a higher priority than adding, so the pattern is equivalent

     "(?:Yes)|(?:(?:No)(.*)End)" 

    What you wanted is the following pattern:

     "(?:(?:Yes)|(?:No))(.*)End" 

    This gives the same result as your first template.

    In your test, the second pattern has group 1 in the (empty) range [-1, -1[ , because this group does not match (start -1 is turned on, -1 exception is thrown, which makes the half-open interval empty).

  • A capture group is a group that can capture input. If it captures, then it is also said that it matches some input substring. If the regular expression contains a choice, then not every capture group can actually capture the input, so there may be groups that do not match, even if the regular expression matches.

  • The number of groups returned by Matcher.groupCount() is obtained only by counting the parentheses of the grouping of the captured groups, regardless of whether any of them matches any input. Your model has exactly one capture group: (.*) . This is group 1. The documentation states:

     (?:X) X, as a non-capturing group 

    and explains :

    Groups starting with (? Are either pure, non-capturing groups that do not capture text, and are not counted for a general group or a group with a capture name.

    Regardless of whether a particular group corresponds to a given input, it does not matter for this definition. For example, in the template (Yes)|(No) there are two groups ( (Yes) - group 1, (No) - group 2), but only one of them can correspond to any given input.

  • A call to Matcher.find() returns true if the regular expression has been matched to some substring. You can determine which groups match by looking at their start: if it is -1, then the group does not match. In this case, the end is also -1. There is no built-in method that tells you how many capture groups actually matched after calling find() or match() . You would have to count them yourself by looking at the beginning of each group.

  • When it comes to backlinks, pay attention to what the regex tutorial should say:

    There is a difference between the inverse of the capture group, which did not match anything, and one capture group, which did not participate at all in the match.

+7
source share

Due to priority "|" operator in the template, the second template is equivalent:

 (?:Yes)|((?:No)(.*)End) 

Do you want to

 (?:(?:Yes)|(?:No))(.*)End 
+3
source share

When using a regular expression, it is important to remember that there is an implicit AND operator at work. This can be seen from the JavaDoc for java.util.regex.Pattern , covering the logical operators:

Logical operators
XY X followed by Y
X | Y either X or Y
(X) X, as a capture group

This AND takes precedence over OR in the second pattern. The second pattern is equivalent to (?:Yes)|(?:(?:No)(.*)End) .
For it to be equivalent to the first pattern, it must be changed to
(?:(?:Yes)|(?:No))(.*)End

+1
source share

All Articles