How to correctly determine the character encoding of text files?

Question

How to correctly determine the character encoding of text files?

Here is my situation: I need to correctly determine which character encoding is used for a given text file. Hope it can correctly return one of the following types:

enum CHARACTER_ENCODING { ANSI, Unicode, Unicode_big_endian, UTF8_with_BOM, UTF8_without_BOM };

So far, I can correctly inform the Unicode , Unicode big endian or UTF-8 with BOM text file by calling the following function. It can also correctly determine for ANSI if the given text file is not originally UTF-8 without BOM . The problem is that when the text file is UTF-8 without BOM , the next function will mistakenly treat it as an ANSI file.

 CHARACTER_ENCODING get_text_file_encoding(const char *filename) { CHARACTER_ENCODING encoding; unsigned char uniTxt[] = {0xFF, 0xFE};// Unicode file header unsigned char endianTxt[] = {0xFE, 0xFF};// Unicode big endian file header unsigned char utf8Txt[] = {0xEF, 0xBB};// UTF_8 file header DWORD dwBytesRead = 0; HANDLE hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL); if (hFile == INVALID_HANDLE_VALUE) { hFile = NULL; CloseHandle(hFile); throw runtime_error("cannot open file"); } BYTE *lpHeader = new BYTE[2]; ReadFile(hFile, lpHeader, 2, &dwBytesRead, NULL); CloseHandle(hFile); if (lpHeader[0] == uniTxt[0] && lpHeader[1] == uniTxt[1])// Unicode file encoding = CHARACTER_ENCODING::Unicode; else if (lpHeader[0] == endianTxt[0] && lpHeader[1] == endianTxt[1])// Unicode big endian file encoding = CHARACTER_ENCODING::Unicode_big_endian; else if (lpHeader[0] == utf8Txt[0] && lpHeader[1] == utf8Txt[1])// UTF-8 file encoding = CHARACTER_ENCODING::UTF8_with_BOM; else encoding = CHARACTER_ENCODING::ANSI; //Ascii delete []lpHeader; return encoding; }

This problem has blocked me for a long time, and I still cannot find a good solution. Any hint would be appreciated.

+6

c ++ text character-encoding

herohuyongtao Dec 23 '13 at 15:47

source share

1 answer

deceze · Accepted Answer · 2013-12-23T16:08:08+0000

For starters, there is no such physical encoding as "Unicode". Most likely, this means UTF-16. Secondly, any file is valid in "ANSI" or any single-byte encoding. The only thing you can do is guess in the best order, which is likely to throw out invalid matches.

You should check in the following order:

Is there a UTF-16 specification in the beginning? Then it is probably UTF-16. Use the spec as an indicator, be it the big end or the small end, then check the rest of the file to see if it matches.
Is there a UTF-8 spec at the beginning? Then it's probably UTF-8. Check the rest of the file.
If the above does not lead to a positive match, check if the whole UTF-8 file is valid. If so, probably UTF-8.
If the above does not lead to a positive match, then possibly ANSI.

If you expect UTF-16 files without specification (this is possible, for example, for XML files that specify the encoding in the XML declaration), then you will also need to enforce this rule. Although any of the above can lead to a false positive that falsely identifies the ANSI file as UTF- * (although this is unlikely). You should always have metadata that tells you what encoding the file is in, and detect it after the fact is impossible with an accuracy of 100%.

How to correctly determine the character encoding of text files?

More articles: