Unable to decode utf-8 string in python on os x terminal.app

I have terminal.app to accept utf-8 and in bash I can enter Unicode characters, copy and paste them, but if I run the python shell, I cannot and if I try to decode unicode, I get errors:

>>> wtf = u'\xe4\xf6\xfc'.decode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) >>> wtf = u'\xe4\xf6\xfc'.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) 

Does anyone know what I'm doing wrong?

+4
source share
4 answers

I think encoding / decoding is happening everywhere. You start with a unicode object:

 u'\xe4\xf6\xfc' 

This is a unicode object, the three characters are the unicode code points for "Àâü". If you want to turn them into Utf-8, you must encode them:

 >>> u'\xe4\xf6\xfc'.encode('utf-8') '\xc3\xa4\xc3\xb6\xc3\xbc' 

The resulting six characters is the Utf-8 "Àâü" representation.

If you call decode(...) , you are trying to interpret the characters as some encoding that still needs to be converted to unicode. Since it is already Unicode, this does not work. Your first call is trying to convert Ascii to Unicode, the second is converting Utf-8 to Unicode. Since u'\xe4\xf6\xfc' is neither a valid Ascii nor a valid Utf-8, these conversion attempts fail.

Further confusion may arise because '\xe4\xf6\xfc' also the Latin1 / ISO-8859-1 encoding for Àâü. If you write a normal python string (without the leading "u" that marks it as unicode), you can convert it to a unicode object using decode('latin1') :

 >>> '\xe4\xf6\xfc'.decode('latin1') u'\xe4\xf6\xfc' 
+18
source

I think you have encoding and decoding back. You encode Unicode into a byte stream and decode a byte stream into Unicode.

 Python 2.6.1 (r261:67515, Dec 6 2008, 16:42:21) [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> wtf = u'\xe4\xf6\xfc' >>> wtf u'\xe4\xf6\xfc' >>> print wtf Àâü >>> wtf.encode('UTF-8') '\xc3\xa4\xc3\xb6\xc3\xbc' >>> print '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf-8') Àâü 
+4
source
 >>> wtf = '\xe4\xf6\xfc' >>> wtf '\xe4\xf6\xfc' >>> print wtf     >>> print wtf.decode("latin-1") Àâü >>> wtf_unicode = unicode(wtf.decode("latin-1")) >>> wtf_unicode u'\xe4\xf6\xfc' >>> print wtf_unicode Àâü 
+3
source

The Unicode string section in the introductory guide explains this well:

To convert a Unicode string to an 8-bit string using a specific encoding, Unicode objects provide an encode () method that takes a single argument, the encoding name. Lower case names for encodings are preferred.

 >>> u"Àâü".encode('utf-8') '\xc3\xa4\xc3\xb6\xc3\xbc' 
+2
source

All Articles