Processing Python Shell Strings

Question

Processing Python Shell Strings

I still don't understand how python unicode and str work. Note. I am working on Python 2, as far as I know, Python 3 has a completely different approach to the same problem.

What i know :

str is the oldest beast that stores strings encoded by one of too many encodings that history has made us work.

unicode is a more standardized way of representing strings using a huge table of all possible characters, emoji, small dog crust images, etc.

The decode function converts strings to unicode, encode does the opposite.

If I, in the python shell, I just say:

 >>> my_string = "some string"

then my_string is the str variable encoded in ascii (and since ascii is a subset of utf-8, it is also encoded in utf-8 ).

Therefore, for example, I can convert this to a unicode variable by specifying one of the lines:

 >>> my_string.decode('ascii') u'some string' >>> my_string.decode('utf-8') u'some string'

What I do not know :

How does Python handle strings without ascii that are passed in the shell, and knowing this, what is the correct way to save the word "kožušček" ?

For example, I can say

 >>> s1 = 'kožušček'

In this case, s1 becomes an instance of str , which I cannot convert to unicode :

 >>> s1='kožušček' >>> s1 'ko\x9eu\x9a\xe8ek' >>> print s1 kožušček >>> s1.decode('ascii') Traceback (most recent call last): File "<pyshell#23>", line 1, in <module> s1.decode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0x9e in position 2: ordinal not in range(128)

Now, of course, I cannot decode the string using ascii , but what encoding should I use? In the end, my sys.getdefaultencoding() returns ascii ! What encoding s1=kožušček Python use to encode s1 when supplying the string s1=kožušček ?

Another thought I had was to say

 >>> s2 = u'kožušček'

But then when I printed s2 , I got

 >>> print s2 kouèek

which means that Python has lost the whole letter. Can someone explain this to me?

+4

string encoding unicode python-2.x utf-8

5xum Jul 30 '15 at 7:42

source share

2 answers

Your system does not necessarily use the sys.getdefaultencoding() encoding; this is just the default value used in the conversion without specifying the encoding, as in:

 >>> sys.getdefaultencoding() 'ascii' >>> unicode(s1) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128)

The idea of Python for your system language system is in the locale module :

 >>> import locale >>> locale.getdefaultlocale() ('en_US', 'UTF-8') >>> locale.getpreferredencoding() 'UTF-8'

And using this, we can decode the string:

 >>> u1=s1.decode(locale.getdefaultlocale()[1]) >>> u1 u'ko\u017eu\u0161\u010dek' >>> print u1 kožušček

The likelihood that the locale is not configured, as is the case for the 'C' locale. This may cause the specified encoding to be None , although the default value is 'ascii' . This usually means that it is the setlocale value that getpreferredencoding will automatically call. I would suggest calling it once in your program launch and save the return value for future use. The encoding used for file names may also be another case specified in sys.getfilesystemencoding ().

Python's internal standard encoding is configured by the site module , which contains:

 def setencoding(): """Set the string encoding used by the Unicode implementation. The default is 'ascii', but if you're willing to experiment, you can change this.""" encoding = "ascii" # Default value set by _PyUnicode_Init() if 0: # Enable to support locale aware default string encodings. import locale loc = locale.getdefaultlocale() if loc[1]: encoding = loc[1] if 0: # Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion. encoding = "undefined" if encoding != "ascii": # On Non-Unicode builds this will raise an AttributeError... sys.setdefaultencoding(encoding) # Needs Python Unicode build !

So, if you want it to be installed by default in every Python run, you can change this first if 0 to if 1 .

+2

Yann vernier Jul 30 '15 at 8:18

source share

Martijn pieters · Accepted Answer · 2015-07-30T07:47:18+0000

str objects contain bytes. The fact that these bytes represent Python does not dictate. If you created ASCII-compatible bytes, you can decode them as ASCII. If they contain bytes representing UTF-8 data, they can be decoded as such. If they contain bytes representing the image, then you can decode this information and display the image somewhere. When you use repr() in a str object, Python leaves any bytes that are ASCII printing as such, the rest are converted to escape sequences; this allows you to debug such information in almost ASCII environment.

Your terminal or console that runs the interactive interpreter writes bytes to the stdin stream that Python reads from the moment it is entered. These bytes are encoded according to the configuration of this terminal or console.

In your case, your console encoded the input that you entered into the Windows code page, most likely. You will need to determine the exact code page and use this codec to decode bytes. Codepage 1252 seems to fit:

 >>> print 'ko\x9eu\x9a\xe8ek'.decode('cp1252') kožušèek

When you print the same bytes, your console reads these bytes and interprets them in the same codec that is already configured with.

Python can tell you which codec it considers to be its console; he is trying to discover this information for Unicode literals, where the input should be decoded for you. It uses the locale.getpreferredencoding() function to determine this, and the sys.stdin and sys.stdout have the encoding attribute; I have UTF-8 installed:

 >>> import sys >>> sys.stdin.encoding 'UTF-8' >>> import locale >>> locale.getpreferredencoding() 'UTF-8' >>> 'kožušèek' 'ko\xc5\xbeu\xc5\xa1\xc3\xa8ek' >>> u'kožušèek' u'ko\u017eu\u0161\xe8ek' >>> print u'kožušèek' kožušèek

Since my terminal is configured for UTF-8, and Python has detected this, the Unicode u'...' literal is used. Data is automatically decoded by Python.

Why exactly your console lost the whole letter that I do not know; I should have access to your console and do some more experimentation, see print repr(s2) output and check all bytes between 0x00 and 0xFF to see if it is on the input or output side of the console.

I recommend you read Python and Unicode:

Pragmatic Unicode by Ned Batchelder
Absolute Minimum Every software developer Absolutely, positively needs to know about Unicode and character sets (no excuses!) From Joel Spolsky
Python Unicode HOWTO

Processing Python Shell Strings

More articles: