Python reverses UTF-8 string

I am currently learning Python and as a Slovenian, I often use UTF-8 characters to test my programs. Usually everything works fine, but there is one catch that I cannot catch. Although I have an encoding declared at the top of the file, it fails when I try to change a line containing special characters

#-*- coding: utf-8 -*- a = "čšž" print a #prints čšž b = a[::-1] print b #prints  šō  instead of žšč 

Is there any way to fix this?

+6
source share
1 answer

Python 2 strings are byte strings, and UTF-8 encoded text uses several bytes per character. Just because your terminal can interpret UTF-8 bytes as characters does not mean that Python knows which bytes form one UTF-8 character.

Your byte string consists of 6 bytes, every two bytes form one character:

 >>> a = "čšž" >>> a '\xc4\x8d\xc5\xa1\xc5\xbe' 

However, how many bytes UTF-8 uses depends on where the character is defined in the Unicode standard; ASCII characters (the first 128 characters in the Unicode standard) only need 1 byte, and for many emoji 4 bytes are required!

In UTF-8, order is everything; reversing the above bytestring cancels the bytes, which leads to some gibberish regarding the UTF-8 standard, but the middle 4 bytes are simply valid UTF-8 sequences (for š and ō ):

 >>> a[::-1] '\xbe\xc5\xa1\xc5\x8d\xc4' -----~~~~~~~~^^^^^^^^#### | š ō | \ \ invalid UTF8 byte opening UTF-8 byte missing a second byte 

You will have to decode the byte string into a unicode object that consists of individual characters. Reversing this object gives the correct results:

 b = a.decode('utf8')[::-1] print b 

You can always encode the object back to UTF-8 again:

 b = a.decode('utf8')[::-1].encode('utf8') 

Please note that in Unicode you may run into problems when handling text when combining characters . Reverse text with combined characters places these character combinations in front of and not after the character with which they are combined, so instead they will be combined with the wrong character:

 >>> print u'e\u0301a' éa >>> print u'e\u0301a'[::-1] áe 

You can basically avoid this by converting Unicode data to its normalized form (which replaces combinations with 1-code point forms), but there are many other exotic Unicode characters that don't play well with line canceling.

+13
source

All Articles