Confusion over group capture behavior in java regex

In this answer, I recommended using

s.replaceFirst("\\.0*$|(\\.\\d*?)0+$", "$1"); 

but two people complained that the result contained the string "null", for example 23.null . This can be explained by the expression $1 (ie group(1) ) null , which can be converted through String.valueOf to the string "null". However, I always get an empty string. My testcase covers it and

 assertEquals("23", removeTrailingZeros("23.00")); 

passes. The exact behavior is undefined?

+7
java regex replace
source share
3 answers

The documentation for the Matcher class from the reference implementation does not indicate the behavior of the appendReplacement method when a capture group that doesn’t work doesn’t capture anything ( null ) is indicated in the replacement string. Although the behavior of the group method is clear, nothing is mentioned in the appendReplacement method.

Below are 3 exhibits of the difference in implementation for the above case:

  • The reference implementation does not add anything (or we can say add an empty string) for the above case.
  • The GNU Classpath and Android interface adds null for the above case.

Some codes have been omitted for brevity and indicated by ...

1) Sun / Oracle JDK, OpenJDK (reference implementation)

For the reference implementation (Sun / Oracle JDK and OpenJDK), the code for appendReplacement does not seem to have changed with Java 6, and it will not add anything when the capture group does not capture anything:

  } else if (nextChar == '$') { // Skip past $ cursor++; // The first number is always a group int refNum = (int)replacement.charAt(cursor) - '0'; if ((refNum < 0)||(refNum > 9)) throw new IllegalArgumentException( "Illegal group reference"); cursor++; // Capture the largest legal group string ... // Append group if (start(refNum) != -1 && end(refNum) != -1) result.append(text, start(refNum), end(refNum)); } else { 

Link

2) GNU object path

The GNU Classpath, which is a complete implementation of the Java class library, has a different implementation for appendReplacement in the above case. In Classpath, the classes in the java.util.regex package in Classpath are just the wrapper for classes in gnu.java.util.regex .

Matcher.appendReplacement calls RE.getReplacement to handle replacement for the agreed part:

  public Matcher appendReplacement (StringBuffer sb, String replacement) throws IllegalStateException { assertMatchOp(); sb.append(input.subSequence(appendPosition, match.getStartIndex()).toString()); sb.append(RE.getReplacement(replacement, match, RE.REG_REPLACE_USE_BACKSLASHESCAPE)); appendPosition = match.getEndIndex(); return this; } 

RE.getReplacement calls REMatch.substituteInto to get the contents of the capture group and directly add its result:

  case '$': int i1 = i + 1; while (i1 < replace.length () && Character.isDigit (replace.charAt (i1))) i1++; sb.append (m.substituteInto (replace.substring (i, i1))); i = i1 - 1; break; 

REMatch.substituteInto attaches the result of REMatch.toString(int) directly, without checking whether the capture group has captured anything:

  if ((input.charAt (pos) == '$') && (Character.isDigit (input.charAt (pos + 1)))) { // Omitted code parses the group number into val ... if (val < start.length) { output.append (toString (val)); } } 

And REMatch.toString(int) returns null when the capture group does not capture (irrelevant code has been omitted).

  public String toString (int sub) { if ((sub >= start.length) || sub < 0) throw new IndexOutOfBoundsException ("No group " + sub); if (start[sub] == -1) return null; ... } 

So, in the case of the GNU class, Classpath null will be added to the line when a capture group that cannot capture anything is specified in the replacement string.

3) Android Open Source Project - Java Core Libraries

In Android, Matcher.appendReplacement calls the private appendEvaluated method, which in turn directly adds the result of group(int) to the replacement string.

 public Matcher appendReplacement(StringBuffer buffer, String replacement) { buffer.append(input.substring(appendPos, start())); appendEvaluated(buffer, replacement); appendPos = end(); return this; } private void appendEvaluated(StringBuffer buffer, String s) { boolean escape = false; boolean dollar = false; for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); if (c == '\\' && !escape) { escape = true; } else if (c == '$' && !escape) { dollar = true; } else if (c >= '0' && c <= '9' && dollar) { buffer.append(group(c - '0')); dollar = false; } else { buffer.append(c); dollar = false; escape = false; } } // This seemingly stupid piece of code reproduces a JDK bug. if (escape) { throw new ArrayIndexOutOfBoundsException(s.length()); } } 

Since Matcher.group(int) returns null to capture a group that it cannot capture, Matcher.appendReplacement adds null when the capture group is mentioned in the replacement string.

Most likely, 2 people complaining about you run their code on Android.

+4
source share

Having looked closely at Javadok, I came to the conclusion that

  • $1 equivalent to calling group(1) , which is specified to return null when the group was not captured.
  • The handling of nulls in the replacement expression is undefined.

The wording of the relevant parts of the Javadoc is generally surprisingly vague (emphasis mine):

Dollar signs can be considered as references to captured subsequences, as described above ...

+4
source share

You have two alternatives | or-ed together, but only the second is between ( ) , therefore, if the first alternative is matched, group 1 is null.

In common place brackets around all alternatives

In your case, you want to replace

  • "xxx.00000" to "xxx" or else
  • "xxx.yyy00" to "xxx.yyy"

Better do it in two steps, as it is more readable:

  • "xxx.y * 00" to "xxx.y *", then
  • xxx. by "xxx"

This is a bit more, changing the initial "1". to "1". So:

 .replaceFirst("(\\.\\d*?)0+$", "$1").replaceFirst("\\.$", ""); 
+2
source share

All Articles