Determining if the file is probably UTF-8 or not should be pretty simple. Determining the encoding, if it is not UTF-8, would be very difficult in general.
If the file is encoded using UTF-8, the high-order bits of each byte must follow the pattern. If the character is one byte, its most significant bit will be cleared (zero). Otherwise, the n byte character (where n is 2 to 4) will contain the bit n first byte set to one, followed by one zero bit. The following n - 1 bytes must have the maximum bit and the second largest bit.
If all the bytes in your file comply with these rules, it is probably encoded using UTF-8. I say, probably because anyone can invent a new encoding, which happens by the same rules, intentionally or accidentally, but interprets the codes in different ways.
Note that a file encoded using US-ASCII will follow these rules, but the high bit of each byte is zero. It is good to process a file such as UTF-8, since they are compatible in this range. Otherwise, this is some other encoding, and there is no built-in test to distinguish the encoding. You will need to use some contextual knowledge to guess.
erickson
source share