Java replaceAll with backlinks

Possible duplicate:
String.replaceAll () regular expression greedy quantifier anomaly

I wrote code that uses Matcher#replaceAll and found the following result very confusing:

 Pattern.compile("(.*)").matcher("sample").replaceAll("$1abc"); 

Now I would expect the output to be sampleabc , but Java throws sampleabcabc at me.

Does anyone have any ideas why?

Now, of course, when I bind the pattern ( ^(.*)$ ), The problem disappears. However, I don't know why the hell replaceAll will do a double replacement.

And add insult to injury by following the code:

 Pattern.compile("(.*)").matcher("sample").replaceFirst("$1abc") 

works as expected, returning only sampleabc .

+7
source share
2 answers

It seems that for some reason, it matches an empty line at the end of the input. (I can understand why this will fit, I'm intrigued that it matches once and only once.)

If you change replaceAll("$1abc") to replaceAll("'$1'abc") , the result will be 'sample'abc''abc .

Please note that if you change (.*) To (.+) , Then it works correctly, because it must match at least one character.

The diagnosis is confirmed by this code:

 Matcher matcher = Pattern.compile("(.*)").matcher("sample"); while (matcher.find()) { System.out.printf("%d to %d\r\n", matcher.start(), matcher.end()); } 

... which produces:

 0 to 6 6 to 6 
+5
source

Two things happen here that explain why this happens:

  • (.*) will successfully match empty strings.
  • After the match is successful, another match will be performed one position after the end of the previous match.

So, after the entire string "sample" been matched, another match is made immediately after e . Despite the absence of characters remaining after the match, a second replacement occurs.

Additional replacements do not arise, because the regex engine will always move forward. Immediately after the last character is a valid starting index, so that the empty string will match once, but after the empty string is matched, there are no more correct starting positions for the regex engine to try to match.

As an alternative to adding the beginning of string binding to your regular expression, you can change your regular expression to match one or more characters by changing (.*) To (.+) .

+5
source

All Articles