BufferedReader: detect byte offset of strings

I use BufferedReader to read a byte stream (UTF-8 text) in turn. For a certain reason, I need to know exactly where in the byte stream the string begins.

Problem: I can’t use the position of the InputStream I can connect to the BufferedReader, because ... are the reader buffers and reads more lines at a time.

My question is: How to determine the exact byte offset of each line?

One obvious (but incorrect) solution would be to use (line + "\ n"). getBytes ("UTF-8"). There are two problems with this approach: 1) just to count the number of bytes, this is pretty overhead to convert the string back to bytes and 2) the newline string is not always marked as "\ n" - it can also be "\ r \ n "etc.

Are there any other solutions for this?

EDIT: every class similar to LineReader that I have seen so far seems to be buffered. Does anyone know about the unbuffered LineReader class?

+4
source share
2 answers

Just read the file as raw bytes, the new line in UTF-8 will always be either 13 , and 10 , 13 or 10 ... but this is exactly the same problem you would have if you read the file as a string, if the files have different EOL agreements.

Raw byte equivalent of BufferedReader BufferedInputStream

You can also read UTF-8 bytes of a string without encoding:

 public static int byteCountUTF8(String input) { int ret = 0; for (int i = 0; i < input.length(); ++i) { int cc = Character.codePointAt(input, i); if (cc <= 0x7F) { ret++; } else if (cc <= 0x7FF) { ret += 2; } else if (cc <= 0xFFFF) { ret += 3; } else if (cc <= 0x10FFFF) { ret += 4; i++; } } return ret; } 
+1
source

Try setting the buffer size:

 BufferedReader (Reader in, int sz) 

Options:

in - Reader

sz - input buffer size

set the buffering size to 1.

0
source

All Articles