How to perform random reads of a UTF8 file

Question

How to perform random reads of a UTF8 file

I understand that reading into a UTF8 or UTF16 encoded file may not necessarily be random due to a random surrogate byte (for example, in eastern languages).

How can I use .NET to go to an approximate position in a file and read Unicode text from a semi-random position?

Discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks ? Do I have to wait until I start decoding?

+5

c # unicode utf-8 utf-16 utf8-decode

CHI Coder 007 Feb 08 '11 at 15:35

source share

3 answers

, UTF-8, , , " " ' (, , ), , . - :

- , , ; n , , , .
1..<guessed number of characters in file>
( , , ), :
, UTF-8, . ,

, "", , , :

A: 1000-1999 B: 2000-2999

1998-2001, .

A: 3000-3999

A B, .

@jleedev , , , " " . .

+2

AakashM 08 . '11 16:34

For UTF-16, you always need to go to the byte position. Then you can check if a subsequent surrogate follows. If so, skip it, otherwise you are at the beginning of a well-formed sequence of UTF-16 code (always assuming that the file is well-formed, of course).

Unicode UTF-8 and UTF-16 encodings have been specifically designed for self-synchronization, and there are strong guarantees that you need to skip no more than a small number of code blocks.

+1

Philipp Feb 09 '11 at 14:32

source share

Jaroslav Jandek · Accepted Answer · 2011-02-08T16:55:08+0000

, UTF-8 .
10 ( ). , 10, UFT-8, , UTF-8.

How to perform random reads of a UTF8 file

More articles: