How to perform random reads of a UTF8 file

I understand that reading into a UTF8 or UTF16 encoded file may not necessarily be random due to a random surrogate byte (for example, in eastern languages).

How can I use .NET to go to an approximate position in a file and read Unicode text from a semi-random position?

Discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks ? Do I have to wait until I start decoding?

+5
source share
3 answers

, UTF-8 .
10 ( ). , 10, UFT-8, , UTF-8.

+8

, UTF-8, , , " " ' (, , ), , . - :

  • - , , ; n , , , .
  • 1..<guessed number of characters in file>
  • ( , , ), :
  • , UTF-8, . ,

, "", , , :

A: 1000-1999 B: 2000-2999

1998-2001, .

A: 3000-3999

A B, .


@jleedev , , , " " . .

+2

For UTF-16, you always need to go to the byte position. Then you can check if a subsequent surrogate follows. If so, skip it, otherwise you are at the beginning of a well-formed sequence of UTF-16 code (always assuming that the file is well-formed, of course).

Unicode UTF-8 and UTF-16 encodings have been specifically designed for self-synchronization, and there are strong guarantees that you need to skip no more than a small number of code blocks.

+1
source

All Articles