Byte string literal with non-ascii characters

Apparently, I can do this in Python 2.7:

value = 'ćœ‹èŻ' 

Python seems to use encoding to encode characters in a string literal into a byte string. What is this encoding? Is it the encoding defined in sys.getdefaultencoding() , the encoding of the source file, or something else?

thanks

+6
source share
2 answers

getdefaultencoding has nothing to do with the encoding of the source file or terminal. This is the encoding used to implicitly convert byte strings to Unicode strings and should always be "ascii" in Python 2.X ("utf8" in Python 3.X).

In Python 2.X, your line of code in a script without an encoding declaration produces an error:

 SyntaxError: Non-ASCII character '\x87' in file ... 

The actual non-ASCII character may be different, but it will not work without an encoding declaration. A coding declaration is required to use non-ASCII characters in Python 2.X. The encoding declaration must match the encoding of the source file. For instance:

 # coding: utf8 value = 'ćœ‹èŻ' 

when saved, since cp936 produces:

 SyntaxError: 'utf8' codec can't decode byte 0x87 in position 9: invalid start byte 

When the encoding is correct, the bytes in the byte string are literally in the source file, so they will contain encoded bytes of characters. When Python parses a Unicode string, the bytes are decoded using the declared Unicode source encoding. Note the difference when printing a UTF-8 byte string and a Unicode string on the cp936 console:

 # coding: utf8 value = 'ćœ‹èŻ' print value,repr(value) value = u'ćœ‹èŻ' print value,repr(value) 

Conclusion:

 鍩欭ćœČ '\xe5\x9c\x8b\xe8\x8f\xaf'ćœ‹èŻ u'\u570b\u83ef' 

The byte string contains three-byte UTF-8 encodings of two characters, but is displayed incorrectly because the sequence of bytes is not understood by the cp936 terminal. Unicode is printed correctly, and the line contains Unicode code points decoded from the UTF-8 bytes of the source file.

Pay attention to the difference when declaring and using the encoding that corresponds to the terminal:

 # coding: cp936 value = 'ćœ‹èŻ' print value,repr(value) value = u'ćœ‹èŻ' print value,repr(value) 

Conclusion:

 ćœ‹èŻ '\x87\xf8\xc8A'ćœ‹èŻ u'\u570b\u83ef' 

The contents of the byte string are now 2-byte cp936 encodings of two characters ("A" equivalent to "\ x41") and are displayed correctly since the terminal understands the cp936 byte sequence. The Unicode string contains the same Unicode code points for two characters as a previous example, because the source byte sequence was decoded using the declared Unicode source encoding.

If the script has the correct source encoding declaration and uses Unicode strings for text, it displays the correct characters 1 regardless of terminal encoding 2 . It will throw a UnicodeEncodeError if the terminal does not support the character and does not display the wrong character.

Final note: Python 2.X uses ascii encoding by default unless otherwise stated, and allows non-ASCII characters in byte strings if encoding supports them. Python 3.X uses the "utf8" encoding by default (so be sure to save it in this encoding or declare otherwise) and does not allow non-ASCII characters in byte strings.

1 If the terminal font supports the character.
2 If the terminal encoding supports the character.

+7
source
 value = b'ćœ‹èŻ' 

doesn't make sense (implies b in Python 2.x) - why do we need a byte string containing characters? Python just plays bytes in any encoding used by your terminal / editor. What you want is a character string:

 value = u'ćœ‹èŻ' 

In the source code file (unlike the interactive shell), do not forget to declare the encoding by adding the following line to the beginning of the file:

 # -*- coding: utf-8 -*- 
0
source

Source: https://habr.com/ru/post/923065/


All Articles