For example, I allow the user to use Unicode UTF-8 and iso-8859-2 for their csv files. Is it possible to detect whether it is the former or the last?
This is not possible at 100% accuracy, because, for example, C3 B1 bytes are the same valid representation of "Ă ±" in ISO-8859-2, since they have a "-" in UTF-8. In fact, since ISO-8859-2 assigns a character to all 256 possible bytes, each UTF-8 string is also a valid ISO-8859-2 string (representing different characters, if not ASCII).
However, the converse is not true. UTF-8 has strict rules about which sequences are valid. More than 99% of the possible 8-octet sequences are unacceptable UTF-8. And your CSV files are probably much longer. Because of this, you can get good accuracy if you:
- Verify UTF-8. If it passes, suppose the data is UTF-8.
- Otherwise, suppose this is ISO-8859-2.
However, is it possible to detect whether encoding is one of the two allowed?
UTF-32 (or byte order), UTF-8, and CESU-8 can be reliably detected by verification. UTF-16 can be detected by the specification (but not by verification), since the only way when an even-length byte sequence is invalid for UTF-16 is to have unpaired surrogates).
If you have at least one “detectable” encoding, you can check the detected encoding and use an undetectable encoding as a backup.
If both encodings are “undetectable,” for example, ISO-8859-1 and ISO-8859-2, then this is more complicated. You can try a statistical approach like chardet .
dan04 source share