Printing objects and Unicode, what's under the hood? What are some good recommendations?

Question

Printing objects and Unicode, what's under the hood? What are some good recommendations?

I am struggling with print and unicode conversion. Here is the code executed in a 2.5-window interpreter.

>>> import sys >>> print sys.stdout.encoding cp850 >>> print u"é" é >>> print u"é".encode("cp850") é >>> print u"é".encode("utf8") ├® >>> print u"é".__repr__() u'\xe9' >>> class A(): ... def __unicode__(self): ... return u"é" ... >>> print A() <__main__.A instance at 0x0000000002AEEA88> >>> class B(): ... def __repr__(self): ... return u"é".encode("cp850") ... >>> print B() é >>> class C(): ... def __repr__(self): ... return u"é".encode("utf8") ... >>> print C() ├® >>> class D(): ... def __str__(self): ... return u"é" ... >>> print D() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128) >>> class E(): ... def __repr__(self): ... return u"é" ... >>> print E() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

So, when a Unicode string is printed, it is not the __repr__() function that is called and printed.
But when an __str__() or __repr__() object is printed (if __str__ not implemented) it is called, not __unicode__() . Both cannot return a Unicode string.
But why? Why, if __repr__() or __str__() returns a Unicode string, should there not be the same behavior as when printing a string in Unicode? In other words: why print D() is different from print D().__str__()

Did I miss something?

These examples also show that if you want to print an object represented by unicode strings, you must encode it into an object string (type str). But for good printing (avoid "├®"), it depends on the sys.stdout encoding.
So, do you need to add u"é".encode(sys.stdout.encoding) for each of the __str__ or __repr__ ? Or return a resume (u "é")? What if I use a pipeline? Is the encoding the same as sys.stdout ?

My main problem is to make the class "printable", i.e. print A() prints something completely readable (not with Unicode characters \ x ***). Here is the bad behavior / code that needs to be changed:

 class User(object): name = u"Luiz Inácio Lula da Silva" def __repr__(self): # returns unicode return "<User: %s>" % self.name # won't display gracefully # expl: print repr(u'é') -> u'\xe9' return repr("<User: %s>" % self.name) # won't display gracefully # expl: print u"é".encode("utf8") -> print '\xc3\xa9' -> ├® return ("<User: %s>" % self.name).encode("utf8")

Thanks!

+6

python unicode printing stdout

Thorfin Aug 24 '10 at 13:46

source share

2 answers

I assume your sys.getdefaultencoding() is still "ascii". And I think this is used when str () or repr () of an object are applied. You can change this with sys.setdefaultencoding() . As soon as you write to a stream, let it be STDOUT or a file, you must observe its encoding. It is also applicable for piping on the body, IMO. I assume that "print" is different from the STDOUT encoding, but an exception occurs before "print" is called when the argument is constructed.

0

ThomasH Aug 24 '10 at 14:03

source share

Alex martelli · Accepted Answer · 2010-08-24T14:21:11+0000

Python does not have a lot of semantic type restrictions for the given functions and methods, but it has several, and here is one of them: __str__ (in Python 2. *) should return a byte string. As usual, if a Unicode object is found where a byte string is required, the current default encoding (usually 'ascii' ) is applied when trying to make the desired byte string from a Unicode object.

For this operation, the encoding (if any) of any given file object does not matter, because what is returned from __str__ can be printed, or it can undergo a completely different and unrelated treatment. Your goal when calling __str__ does not matter for the call itself and its results; Python, in general, does not take into account the "future context" of the operation (what you intend to do with the result after the operation is completed) when defining the semantics of the operation.

This is because Python does not always know your future intentions and is trying to minimize the number of surprises. print str(x) and s = str(x); print s s = str(x); print s (the same operations performed in one gulp versus two), in particular, should have the same effects; if in the second case there will be an exception if str(x) cannot correctly create a string of bytes (that is, for example, x.__str__() cannot), and therefore the exception should also occur in another case.

print itself (starting from 2.4, I suppose), when representing a Unicode object, takes into account the .encoding attribute (if any) of the target stream (by default sys.stdout ); other operations not yet associated with any given target stream are not - and str(x) (i.e. x.__str__() ) is such an operation.

I hope this helped show the reason for the behavior that annoys you ...

Edit : OP now clarifies: "My main problem is to make the class" printable ", that is, print A () prints something completely readable (not with Unicode characters \ x ***).". Here's the approach that I think is best suited for this particular purpose:

 import sys DEFAULT_ENCODING = 'UTF-8' # or whatever you like best class sic(object): def __unicode__(self): # the "real thing" return u'Pel\xe9' def __str__(self): # tries to "look nice" return unicode(self).encode(sys.stdout.encoding or DEFAULT_ENCODING, 'replace') def __repr__(self): # must be unambiguous return repr(unicode(self))

That is, this approach focuses on __unicode__ as the main way for class instances to format them, but since (in Python 2) print calls __str__ instead, it has this delegate for __unicode__ with the best that it can do in terms of encoding. Not perfect, but then the Python 2 print statement is far from perfect, -).

__repr__ , for its part, should strive to be unambiguous , that is, not to “look beautiful” due to the risk of ambiguity (ideally, when possible, it should return a byte string that, if passed to eval , will make the instance equal to the present ... which is far from always possible, but the lack of ambiguity is the absolute core of the difference between __str__ and __repr__ , and I strongly recommend that this difference be respected!).

Printing objects and Unicode, what's under the hood? What are some good recommendations?

More articles: