Java-8 with negative regex expression with `\ R`

While answering another question , I wrote a regular expression to match all spaces before and include no more than one new line. I did this using a negative lookbehind for the \R linebreak matcher:

 ((?<!\R)\s)* 

Subsequently, I thought about it, and I said, oh no, if there is \r\n ? Of course, it will capture the first character of the string \R , and then I will depend on the false \n at the beginning of my next line, right?

So, I went back to testing (and presumably fixed) it. However, when I tested the template, it matched the integer \r\n . It does not match only \R , leaving \n as one would expect.

 "\r\n".matches("((?<!\\R)\\s)*"); // true, expected false 

However, when I use the "equivalent" pattern mentioned in the documentation for \R , it returns false. So is it a bug with Java, or is there a good reason why it matches?

+6
java regex java-8 regex-lookarounds
Feb 26 '17 at 21:42 on
source share
2 answers

Implementation # 1. Incorrect documentation

Source: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

It says here:

Line connector

... equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

However, when we try to use the "equivalent" template, it returns false:

 String _R_ = "\\R"; System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true // using "equivalent" pattern _R_ = "\\u000D\\u000A|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]"; System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // false // now make it atomic, as per sln answer _R_ = "(?>"+_R_+")"; System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true 

So, Javadok must really say:

... is equivalent (?<!\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

March 9, 2017 Patch for Sherman on Oracle JDK-8176029 :

"api doc is NOT mistaken, the implementation is incorrect (which does not allow rollback" 0x0d + next.match () ", when" 0x0d + 0x0a + next.match () "does not work)"




Implementation # 2. Lookbehinds not only look back

Despite the name, lookbehind is not only capable of looking backward, but can even turn on and jump over the current position.

Consider the following example (from rexegg.com ):

 "_12_".replaceAll("(?<=_(?=\\d{2}_))\\d+", "##"); // _##_ 

"This is interesting for several reasons: firstly, we have a look in search, and although we had to look back, this glance jumps to the current position, juxtaposing two numbers and the final underscore."

This means that for our example, \R is that even if our current position may be \n , it will not stop lookbehind from recognizing that \R follows it \n , and then binding the two together as an atomic group and therefore, refuse to recognize the \R part of the current position as a separate match.

Note: for simplicity, I used terms such as "our current position \n ", however this is not an accurate idea of ​​what is going on inside.

+3
Feb 27 '17 at 21:30
source share

The \R construct is a macro that surrounds auxiliary expressions in an atomic group (?> parts ) .

That is why he will not part them.

Note. If Java accepts fixed interlaces in lookbehind, the use of \R is fine, but if the engine does not do this, it will throw an exception.

+5
Feb 26 '17 at 21:52
source share



All Articles