Is it possible to detect the encoding of a text file from two possible?

I read How can I determine the encoding / codepage of a text file Unable to detect encoding. However, is it possible to determine whether encoding is allowed by one of two?

For example, I allow the user to use Unicode UTF-8 and iso-8859-2 for their csv files. Is it possible to determine whether it is the former or the last?

+4
source share
5 answers

For example, I allow the user to use Unicode UTF-8 and iso-8859-2 for their csv files. Is it possible to detect whether it is the former or the last?

This is not possible at 100% accuracy, because, for example, C3 B1 bytes are the same valid representation of "Ă ±" in ISO-8859-2, since they have a "-" in UTF-8. In fact, since ISO-8859-2 assigns a character to all 256 possible bytes, each UTF-8 string is also a valid ISO-8859-2 string (representing different characters, if not ASCII).

However, the converse is not true. UTF-8 has strict rules about which sequences are valid. More than 99% of the possible 8-octet sequences are unacceptable UTF-8. And your CSV files are probably much longer. Because of this, you can get good accuracy if you:

  • Verify UTF-8. If it passes, suppose the data is UTF-8.
  • Otherwise, suppose this is ISO-8859-2.

However, is it possible to detect whether encoding is one of the two allowed?

UTF-32 (or byte order), UTF-8, and CESU-8 can be reliably detected by verification. UTF-16 can be detected by the specification (but not by verification), since the only way when an even-length byte sequence is invalid for UTF-16 is to have unpaired surrogates).

If you have at least one “detectable” encoding, you can check the detected encoding and use an undetectable encoding as a backup.

If both encodings are “undetectable,” for example, ISO-8859-1 and ISO-8859-2, then this is more complicated. You can try a statistical approach like chardet .

+2
source

Since it is not possible to detect an encoding, you still cannot detect it, even if you limit it to two possible encodings.

The only thing I can come up with is that you can try to encode it in one of two possible encodings, but then you will need to check if it worked out correctly. This will include text analysis, and even then you will not be 100% sure that it was correct.

0
source

Both of these encodings have the same value for all octets <128.

So, you need to look at octets> = 128 to make a determination. Since in UTF-8 octets> = 128 are always found in groups (for 2 octets on longer sequences for encoding one code point), the three-octet sequence {<128,> = 128, <128} will be an indicator of ISO-8859-2 .

If the file contains no or very few octets outside of ASCII (i.e., <128), then your ability to determine will be impossible or limited. Of course, if the file starts with the UTF-8 encoded specification (most likely, if from Windows), then you know that it is UTF-8.

It is generally more reliable to use some metadata (as XML does with its declaration) that rely on heuristics because it is possible someone sent you ISO-8859-3.

0
source

If you use StreamReader, there is an overload that will determine the encoding, if possible (BOM), but the default is UTF8 if the detection does not work.

I would suggest you use two options (UTF8 or Current), and if the user selects Current, you use

 var encoding = Encoding.GetEncoding( CultureInfo.CurrentCulture.TextInfo.OEMCodePage); var reader = new StreamReader(encoding); 

which is likely to be the correct coding.

0
source

See my (recent) answer to a related question: How to determine the encoding / codepage of a text file

This class will check if it is possible so that the file is UTF-8, and then it will try to guess if it is probable .

0
source

All Articles