Zero lengths in Java Regex

My code is:

Pattern pattern = Pattern.compile("a?"); Matcher matcher = pattern.matcher("ababa"); while(matcher.find()){ System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end()); } 

Exit:

 0[a]1 1[]1 2[a]3 3[]3 4[a]5 5[]5 

What i know:

  • "but?" denotes zero or one occurrence of the character 'a'.

Java API says:

  • matcher.start () returns the starting index of the previous match.
  • matcher.end () returns the offset after matching the last character.
  • matcher.group () returns the input subsequence matched by the previous match. For matches m with an input sequence s, the expressions m.group () and s.substring (m.start (), m.end ()) are equivalent. And for some patterns, for example *, correspond to an empty string. This method will return an empty string when the template successfully matches an empty string in the input.

What I want to know:

  • In what situations the regex engine encounters zero occurrence of the given symbol (s) - here for the symbol 'a'.
  • In such a situation, the values ​​that are actually returned at the beginning of (), end (), and group () in matches. I mentioned what the java API said. But I do not quite understand when it comes to the practical situation as indicated above.
+7
source share
2 answers

? is a greedy quantifier, so it will first try to match a 1 occurrence before trying to execute a 0 event. In line

  • starts with the first char 'a' and tries to combine the repetition of a 1 occurrence. The "A" char matches, and so it returns the first result you see
  • then he advances and finds "b". The β€œB” char does not match your 1 regex occurrence, so the engine backs off and tries to match the 0 event. The result is that the empty string is matched -> you get your second result.
  • then it moves ahead of b since there are no more matches there, and it starts again with your second 'a' char.
  • etc ... you understand ...

This is a little trickier, but this is the main idea. When a 1-entry cannot match, it will try with a 0-entry.

As for the values ​​of the beginning, end and group, they will be where the match begins, ends, and the group is what was matched, so in the first 0-occurrence match of your string, you get 1, 1 and the emtpy string. I am not sure if this really answers your question.

+10
source

Iterating through a few examples will allow you to clear the functioning of matcher.find() :

The Regex engine takes one character from a string (i.e. ababa) and tries to find if the pattern you are looking for in the string can be found or not. If the template exists, then (as mentioned in the API):

matcher.start () returns the starting index, matcher.end () returns the offset after matching the last character.

If a match does not exist. then start () and end () return the same index that should correspond to the agreed length is zero.

Look down the following examples:

  // Searching for string either "a" or "" Pattern pattern = Pattern.compile("a?"); Matcher matcher = pattern.matcher("abaabbbb"); while(matcher.find()){ System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end()); } 

Output:

  0[a]1 1[]1 2[a]3 3[a]4 4[]4 5[]5 6[]6 7[]7 8[]8 // Searching for string either "aa" or "a" Pattern pattern = Pattern.compile("aa?"); Matcher matcher = pattern.matcher("abaabbbb"); while(matcher.find()){ System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end()); } 

Output:

 0[a]1 2[aa]4 
+3
source

All Articles