BufferedReader: detect byte offset of strings

Question

BufferedReader: detect byte offset of strings

I use BufferedReader to read a byte stream (UTF-8 text) in turn. For a certain reason, I need to know exactly where in the byte stream the string begins.

Problem: I can’t use the position of the InputStream I can connect to the BufferedReader, because ... are the reader buffers and reads more lines at a time.

My question is: How to determine the exact byte offset of each line?

One obvious (but incorrect) solution would be to use (line + "\ n"). getBytes ("UTF-8"). There are two problems with this approach: 1) just to count the number of bytes, this is pretty overhead to convert the string back to bytes and 2) the newline string is not always marked as "\ n" - it can also be "\ r \ n "etc.

Are there any other solutions for this?

EDIT: every class similar to LineReader that I have seen so far seems to be buffered. Does anyone know about the unbuffered LineReader class?

+4

java utf-8 bufferedreader

Johannes Jan 19 '13 at 13:58

source share

2 answers

Try setting the buffer size:

 BufferedReader (Reader in, int sz)

Options:
in - Reader
sz - input buffer size

set the buffering size to 1.

0

shuangwhywhy Jan 19 '13 at 14:22

source share

Esailija · Accepted Answer · 2013-01-19T14:22:53+0000

Just read the file as raw bytes, the new line in UTF-8 will always be either 13 , and 10 , 13 or 10 ... but this is exactly the same problem you would have if you read the file as a string, if the files have different EOL agreements.

Raw byte equivalent of BufferedReader BufferedInputStream

You can also read UTF-8 bytes of a string without encoding:

 public static int byteCountUTF8(String input) { int ret = 0; for (int i = 0; i < input.length(); ++i) { int cc = Character.codePointAt(input, i); if (cc <= 0x7F) { ret++; } else if (cc <= 0x7FF) { ret += 2; } else if (cc <= 0xFFFF) { ret += 3; } else if (cc <= 0x10FFFF) { ret += 4; i++; } } return ret; }

BufferedReader: detect byte offset of strings

More articles: