Yes, str.decode() usually returns a unicode string if the codec successfully decodes bytes. But the values represent only the same text if the correct codec is used.
Your sample text does not use the correct codec; you have GBK encoded text decoded as Latin1:
>>> print u'\u4e2d\u6587'中文>>> u'\u4e2d\u6587'.encode('gbk') '\xd6\xd0\xce\xc4' >>> u'\u4e2d\u6587'.encode('gbk').decode('latin1') u'\xd6\xd0\xce\xc4'
The values are really not equal because they are not the same text .
Again, it is important that you use the correct codec; another codec will lead to completely different results:
>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1') ÖÐÎÄ
I encoded the sample text in Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text cannot be read.
Please note that when inserting non-ASCII characters only , they work because the Python interpreter correctly recognized my final codec. I can paste the text from my browser into my terminal, which then passes the text to Python as UTF-8 encoded data. Since Python asked the terminal which codec was used, it was able to decode it again from the u'....' Unicode literal. When printing the result, encoded.decode('utf8') unicode Python again automatically encodes the data according to my final encoding.
To find out which Python codec is detected, type sys.stdin.encoding :
>>> import sys >>> sys.stdin.encoding 'UTF-8'
Similar decisions should be made when working with various sources of text. For example, reading string literals from a source file requires that you either use only ASCII (and use escape codes for everything else) or provide Python with explicit codec notation at the top of the file.
I urge you to read:
to get a better understanding of how Unicode works, and how Python handles Unicode.