I start by creating a string variable with some non-ascii data encoded in utf-8:
>>> text = 'á'
>>> text
'\xc3\xa1'
>>> text.decode('utf-8')
u'\xe1'
Using unicode()it causes errors ...
>>> unicode(text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
... but if I know the encoding, I can use it as a second parameter:
>>> unicode(text, 'utf-8')
u'\xe1'
>>> unicode(text, 'utf-8') == text.decode('utf-8')
True
Now, if I have a class that returns this text in a method __str__():
>>> class ReturnsEncoded(object):
... def __str__(self):
... return text
...
>>> r = ReturnsEncoded()
>>> str(r)
'\xc3\xa1'
unicode(r)seems to be using on it str(), as it is causing the same error as unicode(text)above:
>>> unicode(r)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
So far, everything is as planned!
But, as no one had ever expected, unicode(r, 'utf-8')would not even try:
>>> unicode(r, 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: coercing to Unicode: need string or buffer, ReturnsEncoded found
Why? Why is this inconsistent behavior? This is mistake? is it intended Very uncomfortable.