Is u'string 'the same as' string'.decode ('XXX')

Question

Is u'string 'the same as' string'.decode ('XXX')

Although the name is a question, the short answer does not seem to be. I tried in the shell. The question is why? ps: string are some characters other than ascii, such as Chinese, and XXX is the current encoding of the string

>>> u'中文' == '中文'.decode('gbk') False //The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'

The example above. I am using windows chinese simplefied. The default encoding is gbk, just like the python shell. And I got two unicode objects unevenly.

UPDATES

 a = '中文'.decode('gbk') >>> a u'\u4e2d\u6587' >>> print a中文>>> b = u'中文' >>> print b ÖÐÎÄ

+6

python unicode decode

zoujyjs Jan 7 '14 at 14:08

source share

2 answers

Assume Python2.7 by name.

The answer is no. No, because when you issue string.decode(XXX) , you will get Unicode depending on the codec that you pass as an argument.

When you use u'string' , the codec is displayed according to the current encoding of the shell, or if it is a file, you will get ascii by default or any other # coding: utf-8 special comment that you insert at the beginning of the script.

Just to clean up, if codec XXX guaranteed to always be the same codec that was used to enter the script (either shell or file), then both approaches behave almost the same.

Hope this helps!

+4

Paulo bu Jan 7 '14 at 14:12

source share

Martijn pieters · Accepted Answer · 2014-01-07T14:11:20+0000

Yes, str.decode() usually returns a unicode string if the codec successfully decodes bytes. But the values represent only the same text if the correct codec is used.

Your sample text does not use the correct codec; you have GBK encoded text decoded as Latin1:

 >>> print u'\u4e2d\u6587'中文>>> u'\u4e2d\u6587'.encode('gbk') '\xd6\xd0\xce\xc4' >>> u'\u4e2d\u6587'.encode('gbk').decode('latin1') u'\xd6\xd0\xce\xc4'

The values are really not equal because they are not the same text .

Again, it is important that you use the correct codec; another codec will lead to completely different results:

 >>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1') ÖÐÎÄ

I encoded the sample text in Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text cannot be read.

Please note that when inserting non-ASCII characters only , they work because the Python interpreter correctly recognized my final codec. I can paste the text from my browser into my terminal, which then passes the text to Python as UTF-8 encoded data. Since Python asked the terminal which codec was used, it was able to decode it again from the u'....' Unicode literal. When printing the result, encoded.decode('utf8') unicode Python again automatically encodes the data according to my final encoding.

To find out which Python codec is detected, type sys.stdin.encoding :

 >>> import sys >>> sys.stdin.encoding 'UTF-8'

Similar decisions should be made when working with various sources of text. For example, reading string literals from a source file requires that you either use only ASCII (and use escape codes for everything else) or provide Python with explicit codec notation at the top of the file.

I urge you to read:

Absolute Minimum Every software developer Absolutely, positively needs to know about Unicode and character sets (no excuses!) From Joel Spolsky
Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

to get a better understanding of how Unicode works, and how Python handles Unicode.

Is u'string 'the same as' string'.decode ('XXX')

More articles: