Unicode concatenation with the string: print '£' + '1' works, but print '£' + u'1 'throws a UnicodeDecodeError

I noticed the following:

>>> print '£' + '1' £1 >>> print '£' + u'1' Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) >>> print u'£' + u'1' £1 >>> print u'£' + '1' £1 

Why does '£' + '1' , but '£' + u'1' does not work?

I looked at the types:

 >>> type('£' + '1') <type 'str'> >>> type('£' + u'1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) >>> type(u'£' + u'1') <type 'unicode'> 

It bothers me too. If '£' + '1' is str and not a unicode , why does it print correctly on my terminal? Shouldn't I type something like '\xc2\xa31'?

To add to the mix, I also noticed the following:

 >>> u'£' + '1' u'\xa31' >>> type('1') <type 'str'> >>> type(u'£') <type 'unicode'> >>> print u'£' + '1' £1 

Why doesn't u'£' + '1' print the character £ correctly, while print u'£' + '1' does? Is it because repr used in the former, whereas str used in the latter?

Also, how does concatenation of unicode and a str work, but not in the case of '£' + u'1' ?

+8
python string-concatenation unicode
source share
1 answer

You are mixing object types.

'£' is a byte string containing encoded data. The fact that these bytes represent the pound sign in your terminal or console does not exist here or there, it can be as many pixels in the image. The terminal or console is configured to create and receive UTF-8 data, so the actual contents of this byte string are two bytes C2 and A3, if they are in hexadecimal format.

u'1' , on the other hand, is a Unicode string. This is clearly text data. If you want to associate other data with it, it must also be Unicode. Python 2 then automatically decodes str bytes in Unicode, using the default ASCII codec if you try to do this.

However, the '£' bytestring is not decoded as ASCII. It can be decoded as UTF-8; decode bytes explicitly, since here we know the correct codec:

 print '£'.decode('utf8') + u'1' 

When writing bytes to a terminal or console, your terminal or console interprets the bytes and understands them. If you write a unicode object for a terminal, the sys.stdout object sys.stdout care of the encoding, converts the text into bytes, which your terminal or console will understand.

The same goes for data entry; the sys.stdin stream creates bytes that Python can transparently decode when you use the u'£' syntax to create a Unicode object. You enter a character on the keyboard, it is translated into UTF-8 bytes by the terminal or console and written to Python for interpretation.

What '\xc2\xa3' with print is a happy coincidence. You can take a unicode object, encode it to another codec and end up with garbage:

 >>> print u'£1'.encode('latin-1') ?1 

My Mac terminal converted the data recorded for the £ sign to ? , because byte A3 (Latin code-1 for the pound sign) does not match anything when interpreted as UTF-8.

Python defines the final or console codec from the locale.getpreferredencoding() function, you can watch what your terminal or console sys.stdout.encoding using sys.stdout.encoding and sys.stdin.encoding :

 >>> import sys >>> sys.stdout.encoding 'UTF-8' 

And last, but not least, you should not confuse printing with representations provided by the interpreter interactively. The interpreter shows the result of the expressions using the repr() function, a debugging tool that tries to get Python literature whenever possible using only ASCII characters. For Unicode values, this means that any non-printable, non-ASCII character is reflected using escape sequences. This makes the value suitable for copying and pasting without requiring more than ASCII-compatible media.

The repr() result for str uses \n for newlines, for example, and \xhh hexadecimal screens for bytes with no escape sequences outside the print range. In addition, for unicode objects, code points outside the Latin-1 range are represented by \uhhhh and \Uhhhhhhhh escape sequences, depending on whether they are part of the main multilingual plane:

 >>> u'''\ ... A multiline string to show newlines ... can contain £ latin characters ... or emoji 💩! ... ''' u'A multiline string to show newlines\ncan contain \xa3 latin characters\nor emoji \U0001f4a9!\n' >>> print _ A multiline string to show newlines can contain £ latin characters or emoji 💩! 
+9
source share

All Articles