Quote from Python documentation :
UTF-8 has several convenient properties:
It can handle any Unicode code point.
A Unicode string turns into a byte string that has no built-in null bytes. This avoids problems with byte order and means that UTF-8 strings can be processed using C functions such as strcpy () and sent via protocols that cannot handle null bytes.
ASCII text string is also valid UTF-8 text.
All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)
To make this clear, check out this console session:
>>> s = 'test' >>> s.encode('ascii') == s.encode('utf-8') True >>>
However, not all UTF-8 encoded string is a valid ASCII string:
>>> foreign_string = u"รฉรขรด" >>> foreign_string.encode('utf-8') '\xc3\xa9\xc3\xa2\xc3\xb4' >>> foreign_string.encode('ascii')
So chardet is still right. Only if there is a character that is not ascii, chardet could say it was not ascii encoded.
Hope this simple explanation helps!
aIKid
source share