Python 2 assumes various source code encodings

Question

Python 2 assumes various source code encodings

I noticed that without declaring the encoding of the source code, the Python 2 interpreter assumes that the source code is encoded in ASCII with scripts and standard input:

$ python test.py # where test.py holds the line: print u'é' File "test.py", line 1 SyntaxError: Non-ASCII character '\xc3' in file test.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details $ echo "print u'é'" | python File "/dev/fd/63", line 1 SyntaxError: Non-ASCII character '\xc3' in file /dev/fd/63 on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

and encoded in ISO-8859-1 with the -m and -c module command flags:

 $ python -m test # where test.py holds the line: print u'é' Ã© $ python -c "print u'é'" Ã©

Where is this documented?

Contrast this with Python 3, which always assumes that the source code is UTF-8 encoded and thus prints é on four occasions.

Note. - I tested this on CPython 2.7.14 on both macOS 10.13 and Ubuntu Linux 17.10 with console encoding installed in UTF-8.

+1

python character-encoding ascii python-internals iso-8859-1

Maggyero Feb 26 '18 at 8:28

source share

1 answer

Martijn pieters · Accepted Answer · 2018-02-26T20:12:00+0000

The -c and -m switches ultimately ^(*) run the code that comes with the exec or compile() function operator, both of which take the source code for Latin-1:

The first expression must be evaluated as a Unicode string, a Latin-1 encoded string, an open file, a code object, and a tuple.

This is not documented; it is an implementation detail that may or may not be considered an error.

I don't think this is something worth fixing, but Latin-1 is a superset of ASCII, so little is lost. How code from -c and -m processed has been cleaned up in Python 3 and is much more consistent there; code transmitted using -c is decoded using the current locale, and modules loaded with the -m switch by default equal to UTF-8, as usual.

^(*) If you want to know the exact applications used, start with the Py_Main() function in Modules/main.c , which treats both -c and -m as:

 if (command) { sts = PyRun_SimpleStringFlags(command, &cf) != 0; free(command); } else if (module) { sts = RunModule(module, 1); free(module); }

-c is executed using the PyRun_SimpleStringFlags() function, which in turn calls PyRun_StringFlags() . When you use exec , the bytestring object is also passed to PyRun_StringFlags() , and then the source code is assumed to contain Latin-1 encoded bytes.
-m uses the RunModule() function to pass the module name the private function _run_module_as_main() to runpy module , which uses pkgutil.get_loader() load the module metadata and extract the module code object using the loader.get_code() function to the PEP 302 Loader ; if the cached bytecode is not available, then the object is created using the compile() function with the mode set to exec .

Python 2 assumes various source code encodings

More articles: