I still don't understand how python unicode and str work. Note. I am working on Python 2, as far as I know, Python 3 has a completely different approach to the same problem.
What i know :
str is the oldest beast that stores strings encoded by one of too many encodings that history has made us work.
unicode is a more standardized way of representing strings using a huge table of all possible characters, emoji, small dog crust images, etc.
The decode function converts strings to unicode, encode does the opposite.
If I, in the python shell, I just say:
>>> my_string = "some string"
then my_string is the str variable encoded in ascii (and since ascii is a subset of utf-8, it is also encoded in utf-8 ).
Therefore, for example, I can convert this to a unicode variable by specifying one of the lines:
>>> my_string.decode('ascii') u'some string' >>> my_string.decode('utf-8') u'some string'
What I do not know :
How does Python handle strings without ascii that are passed in the shell, and knowing this, what is the correct way to save the word "koĆŸuĆĄÄek" ?
For example, I can say
>>> s1 = 'koĆŸuĆĄÄek'
In this case, s1 becomes an instance of str , which I cannot convert to unicode :
>>> s1='koĆŸuĆĄÄek' >>> s1 'ko\x9eu\x9a\xe8ek' >>> print s1 koĆŸuĆĄÄek >>> s1.decode('ascii') Traceback (most recent call last): File "<pyshell#23>", line 1, in <module> s1.decode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0x9e in position 2: ordinal not in range(128)
Now, of course, I cannot decode the string using ascii , but what encoding should I use? In the end, my sys.getdefaultencoding() returns ascii ! What encoding s1=koĆŸuĆĄÄek Python use to encode s1 when supplying the string s1=koĆŸuĆĄÄek ?
Another thought I had was to say
>>> s2 = u'koĆŸuĆĄÄek'
But then when I printed s2 , I got
>>> print s2 kouĂšek
which means that Python has lost the whole letter. Can someone explain this to me?