Reading a Unicode UTF-8 File Using Non-Unicode Code

I need to read a text file that is UTF-8 encoded Unicode, and must write this data to another text file. The file has section delimited data in lines.

My read code is C ++ code without unicode support. What I am doing is reading the file line by line in string/char* and putting that line in the destination file. I cannot change the code, so suggestions for changing the code are not welcome.

What I want to know is that when reading line by line, I can meet the NULL terminating character ('\ 0') in the line, since it is unicode and one character can span several bytes.

My opinion was that it is possible that a trailing NULL character may occur in a string. Your thoughts?

+4
source share
2 answers

UTF-8 uses 1 byte for all ASCII characters that have the same code value, as in standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more than 1 byte, control bits are set.

Thus, your UTF-8 file should not have 0 characters.

Check Wikipedia for UTF-8

+13
source

Very unlikely: all bytes in the UTF-8 escape sequence have a higher bit set to 1.

+1
source

All Articles