This is a common problem, so here is a fairly thorough illustration.
For non-unicode strings (i.e., without the u prefix like u'\xc4pple' ), it is necessary to decode from the native encoding ( iso8859-1 / latin1 , unless modified with the mysterious sys.setdefaultencoding function ) to unicode , then encode to a character set, which can display the characters you want, in which case I would recommend UTF-8 .
Firstly, here is a handy utility function that helps highlight Python 2.7 string and unicode patterns:
>>> def tell_me_about(s): return (type(s), s)
Simple line
>>> v = "\xC4pple"
Decoding a string iso8859-1 - converting a simple string to unicode
>>> uv = v.decode("iso-8859-1") >>> uv u'\xc4pple'
A few more illustrations - with "Ä"
>>> u"Ä" == u"\xc4" True
Coding in utf
>>> u8 = v.decode("iso-8859-1").encode("utf-8") >>> u8 '\xc3\x84pple'
The relationship between unicode and UTF and latin1
>>> print u8 Äpple
Unicode Exceptions
>>> u8.encode('iso8859-1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) >>> u16.encode('iso8859-1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) >>> v.encode('iso8859-1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
You can get around them by going from a specific encoding (latin-1, utf8, utf16) to unicode, for example. u8.decode('utf8').encode('latin1') .
Therefore, perhaps the following principles and generalizations could be used:
- type
str is a set of bytes that can have one of several encodings, such as Latin-1, UTF-8 and UTF-16 - The
unicode type is a set of bytes that can be converted to any number of encodings, most often UTF-8 and latin-1 (iso8859-1) - The
print command has its own encoding logic , is set to sys.stdout.encoding and defaults to UTF-8 - Before converting to another encoding, it is necessary to decode a
str to unicode.
Of course, all this changes in Python 3.x.
Hope this is an insight.
Further reading
And the very illustrative teachings of Armin Ronacher:
Brian M. Hunt Jun 30 '11 at 19:18 2011-06-30 19:18
source share