When I write sysadmin scripts in Python, the buffer on sys.stdout, which affects every print () call, is annoying because I don't want to wait for the buffer to be flushed, and then get a large chunk of lines right on the screen, instead I I want to get separate lines of output as soon as the new output script is created. I don’t even want to wait for new lines, so I look at the output.
A commonly used idiom for this in python is
import os import sys sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)
This has worked great for me for a long time. Now I noticed that it does not work with Unicode. See the following script:
#!/usr/bin/python
This leads to the following conclusion:
Original encoding: UTF-8 New encoding: None <type 'str'> Eisb▒r <type 'unicode'> Traceback (most recent call last): File "./export_debug.py", line 18, in <module> print(text) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 4: ordinal not in range(128)
It took me a few hours to figure out the reason for this (my original script was much longer than this minimal debugging script). This is a string
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)
which I used for years, so I did not expect any problems with it. Just comment out this line and the correct output should look like this:
Original encoding: UTF-8 New encoding: UTF-8 <type 'str'> Eisb▒r <type 'unicode'> Eisbär
So what is the script to do? To prepare Python 2.7 code as close to Python 3.x as possible, I always use
from __future__ import print_function, unicode_literals
which forces python to use the new print () function, but more important: it forces Python to store all strings as Unicode by default. I have a lot of Latin-1 / ISO-8859-1 encoded data, e.g.
text = b'Eisb\xe4r'
To work with it as intended, I need to first decode it in Unicode, which
text = text.decode('latin-1')
for. Since the default encoding is UTF-8 on my system, whenever I print a string, python then encodes the Unicode inner string in UTF-8. But first, it should be in perfect Unicode inside.
Now that everything is working fine at all, just not with an output buffer with a zero byte. Any ideas? I noticed that sys.stdout.encoding is disabled after zero buffering, but I don't know how to set it again. This is a read-only attribute, and the LC_ALL or LC_CTYPE OS environment variables are apparently evaluated only at the beginning of the python interpreter.
Btw: "Icebar" is the German word for "polar bear."