Why does chardet say that my UTF-8 encoded string (originally decoded from ISO-8859-1) is ASCII?

I am trying to convert ascii characters to utf-8. In this small example below, ascii characters are still stored:

chunk = chunk.decode('ISO-8859-1').encode('UTF-8') print chardet.detect(chunk[0:2000]) 

It returns:

 {'confidence': 1.0, 'encoding': 'ascii'} 

How did it happen?

+7
python encoding utf-8 ascii decoding
source share
3 answers

Quote from Python documentation :

UTF-8 has several convenient properties:

  • It can handle any Unicode code point.

  • A Unicode string turns into a byte string that has no built-in null bytes. This avoids problems with byte order and means that UTF-8 strings can be processed using C functions such as strcpy () and sent via protocols that cannot handle null bytes.

  • ASCII text string is also valid UTF-8 text.

All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)

To make this clear, check out this console session:

 >>> s = 'test' >>> s.encode('ascii') == s.encode('utf-8') True >>> 

However, not all UTF-8 encoded string is a valid ASCII string:

 >>> foreign_string = u"รฉรขรด" >>> foreign_string.encode('utf-8') '\xc3\xa9\xc3\xa2\xc3\xb4' >>> foreign_string.encode('ascii') #This won't work, since it invalid in ASCII encoding Traceback (most recent call last): File "<pyshell#9>", line 1, in <module> foreign_string.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) >>> 

So chardet is still right. Only if there is a character that is not ascii, chardet could say it was not ascii encoded.

Hope this simple explanation helps!

+7
source share

UTF-8 is a superset of ASCII . This means that every valid Ascii file (which uses only the first 128 characters, not extended characters) will also be a valid UTF-8 file. Since the encoding is not saved explicitly, but is guessed every time, it will default to a simpler character set. However, if you must encode anything beyond 128 characters (e.g. extraneous text, etc.) into UTF-8, it would be very likely to guess the encoding as UTF-8.

+3
source share

That's why you got ascii

https://github.com/erikrose/chardet/blob/master/chardet/universaldetector.py#L135

If all characters in an ascii sequence chardet characters consider string encoding as ascii

NB

The first 128 Unicode characters that match each other with ASCII are encoded using one octet with the same binary value as ASCII, which makes valid ASCII text valid UTF-8 encoded Unicode.

+1
source share

All Articles