This is what happens:
- sampleString is a byte string (cp1255 encoded)
sampleString.decode("cp1255") decodes (decodes == bytes -> unicode string) byte string to unicode stringprint sampleString.decode("cp1255") trying to print a unicode string in stdout. The print must encode a Unicode string to do this (encode == unicode string β bytes). The error you see means that the python print statement cannot write the given unicode string to console encoding. sys.stdout.encoding is the final encoding.
So the problem is that your console does not support these characters. You should be able to configure the console to use a different encoding. Information on how to do this depends on your OS and terminal program.
Another approach would be to manually specify the encoding used:
print sampleString.decode("cp1255").encode("utf-8")
See also:
A simple test program with which you can experiment:
import sys print sys.stdout.encoding samplestring = '\xe0\xe1\xe2\xe3\xe4' print samplestring.decode("cp1255").encode(sys.argv[1])
On my terminal utf-8:
$ python2.6 test.py utf-8 UTF-8 ΧΧΧΧΧ $ python2.6 test.py latin1 UTF-8 Traceback (most recent call last): UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256) $ python2.6 test.py ascii UTF-8 Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128) $ python2.6 test.py cp424 UTF-8 ABCDE $ python2.6 test.py iso8859_8 UTF-8
Error messages for Latin-1 and ascii mean that Unicode characters in a string cannot be represented in these encodings.
Pay attention to the last two. I encode a Unicode string to cp424 and iso8859_8 encodings (two of the codes listed at http://docs.python.org/library/codecs.html#standard-encodings that support Hebrew characters). I do not get any exceptions using these encodings, since the Hebrew Unicode characters have an encoding representation.
But my utf-8 terminal is very confused when it receives bytes in a different encoding than utf-8.
In the first case (cp424), my UTF-8 terminal displays ABCDE, which means that the utf-8 A representation corresponds to the cp424 representation for Χ, i.e. byte 65 means A in utf-8 and Χ in cp424.
The encode method has an optional string argument, which you can use to indicate what should happen when the encoding cannot represent a character ( documentation ). The supported strategies are strict (default), ignore, replace, xmlcharref and backslashreplace. You can even add your own custom strategies .
Another test program (I print with quotes around the line to better show how ignore behaves):
import sys samplestring = '\xe0\xe1\xe2\xe3\xe4' print "'{0}'".format(samplestring.decode("cp1255").encode(sys.argv[1], sys.argv[2]))
Results:
$ python2.6 test.py latin1 strict Traceback (most recent call last): File "test.py", line 4, in <module> sys.argv[2])) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256) [/tmp] $ python2.6 test.py latin1 ignore '' [/tmp] $ python2.6 test.py latin1 replace '?????' [/tmp] $ python2.6 test.py latin1 xmlcharrefreplace '&