Using a Java scanner with the \ R pattern (buffer boundary issue)

Question

Using a Java scanner with the \ R pattern (buffer boundary issue)

Summary: Are there any caveats / known issues with using \R (or another regex pattern) in the Java Scanner (especially regarding the internal conditions of the buffer border)?

Details:. Since I wanted to make several multi-line patterns in potentially multi-platform input files, I used patterns with \R , which according to Pattern javadoc:

Any Unicode string sequence equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

Anyway, in one of my test files, I noticed that the loop that was supposed to analyze the hex dump block was interrupted. After some debugging, I noticed that the line on which it ends was the end of the scanner’s internal buffer.

Here's the test program I wrote to simulate a situation:

 public static void main(String[] args) throws IOException { testString(1); testString(1022); } private static void testString(int prefixLen) { String suffix = "b\r\nX"; String buffer = new String(new char[prefixLen]).replace("\0", "a") + suffix; Scanner scanner = new Scanner(buffer); String pattern = "b\\R"; System.out.printf("=================\nTest String (Len=%d): '%s'\n'%s' found with horizon=0 (w/o bound): %s\n", buffer.length(), convertLineEndings( buffer), pattern, convertLineEndings(scanner.findWithinHorizon(pattern, 0))); System.out.printf("'X' found with horizon=1: %b\n", scanner.findWithinHorizon("X", 1) != null); scanner.close(); } private static String convertLineEndings(String string) { return string.replaceAll("\\n", "\\\\n").replaceAll("\\r", "\\\\r"); }

... which produces this output (edited for formatting / brevity):

 ================= Test String (Len=5): 'ab\r\nX' 'b\R' found with horizon=0 (w/o bound): b\r\n 'X' found with horizon=1: true ================= Test String (Len=1026): 'a ... ab\r\nX' 'b\R' found with horizon=0 (w/o bound): b\r 'X' found with horizon=1: false

It seems like a mistake to me! I think the scanner should match what suffix with the templates is the same, regardless of where they appear in the input text (until the prefix is connected to the templates). (I also found the possibly relevant Open JDK Bugs 8176407 and 8072582 , but that was with the regular Oracle JDK 8u111).

But I may have missed some recommendations regarding the scanner or the specific use of the \R pattern (or that Open JDK and Oracle have identical (??) implementations for the respective classes here?) ... hence the question!

+8

java java.util.scanner regex

Ozgurh Mar 2 '18 at 14:07

source share

1 answer

wp78de · Answer 1 · 2018-03-05T00:39:52+0000

Two suggestions:

I think you should check X like this:

 System.out.printf("'X' found with horizon=1: %b\n", scanner.findWithinHorizon("X", prefixLen) != null);

(Since nothing but 0 as a horizon parameter limits the search to a certain number of characters. This is the name of the method. Horizon until the method sees it.)

There may be a problem with the encoding of the file. Your scanner may choose the wrong encoding by default. Try something in this direction:

 new Scanner(file, "utf-8");

Using a Java scanner with the \ R pattern (buffer boundary issue)

More articles: