Python string extension issue

I am trying to parse a CSV file containing some data, mostly numeric, but with some lines that I don’t know their encoding, but I know that they are in Hebrew.

In the end, I need to know the encoding so that I can unicode strings, print them, and possibly throw them into the database later.

I tried using Chardet , which claims that the lines are Windows-1255 ( cp1255 ), but trying to print someString.decode('cp1255') gives a notorious error:

 UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128) 

I tried any other encoding, to no avail. In addition, the file is absolutely right, since I can open CSV in Excel, and I see the correct data.

Any idea how I can decode these strings correctly?


EDIT:. Here is an example. One of the lines looks like this (the first five letters of the Hebrew alphabet):

 print repr(sampleString) #prints: '\xe0\xe1\xe2\xe3\xe4' 

(using Python 2.6.2)

+3
source share
4 answers

This is what happens:

  • sampleString is a byte string (cp1255 encoded)
  • sampleString.decode("cp1255") decodes (decodes == bytes -> unicode string) byte string to unicode string
  • print sampleString.decode("cp1255") trying to print a unicode string in stdout. The print must encode a Unicode string to do this (encode == unicode string β†’ bytes). The error you see means that the python print statement cannot write the given unicode string to console encoding. sys.stdout.encoding is the final encoding.

So the problem is that your console does not support these characters. You should be able to configure the console to use a different encoding. Information on how to do this depends on your OS and terminal program.

Another approach would be to manually specify the encoding used:

 print sampleString.decode("cp1255").encode("utf-8") 

See also:

A simple test program with which you can experiment:

 import sys print sys.stdout.encoding samplestring = '\xe0\xe1\xe2\xe3\xe4' print samplestring.decode("cp1255").encode(sys.argv[1]) 

On my terminal utf-8:

 $ python2.6 test.py utf-8 UTF-8 אבגדה $ python2.6 test.py latin1 UTF-8 Traceback (most recent call last): UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256) $ python2.6 test.py ascii UTF-8 Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128) $ python2.6 test.py cp424 UTF-8 ABCDE $ python2.6 test.py iso8859_8 UTF-8       

Error messages for Latin-1 and ascii mean that Unicode characters in a string cannot be represented in these encodings.

Pay attention to the last two. I encode a Unicode string to cp424 and iso8859_8 encodings (two of the codes listed at http://docs.python.org/library/codecs.html#standard-encodings that support Hebrew characters). I do not get any exceptions using these encodings, since the Hebrew Unicode characters have an encoding representation.

But my utf-8 terminal is very confused when it receives bytes in a different encoding than utf-8.

In the first case (cp424), my UTF-8 terminal displays ABCDE, which means that the utf-8 A representation corresponds to the cp424 representation for Χ”, i.e. byte 65 means A in utf-8 and Χ” in cp424.

The encode method has an optional string argument, which you can use to indicate what should happen when the encoding cannot represent a character ( documentation ). The supported strategies are strict (default), ignore, replace, xmlcharref and backslashreplace. You can even add your own custom strategies .

Another test program (I print with quotes around the line to better show how ignore behaves):

 import sys samplestring = '\xe0\xe1\xe2\xe3\xe4' print "'{0}'".format(samplestring.decode("cp1255").encode(sys.argv[1], sys.argv[2])) 

Results:

 $ python2.6 test.py latin1 strict Traceback (most recent call last): File "test.py", line 4, in <module> sys.argv[2])) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256) [/tmp] $ python2.6 test.py latin1 ignore '' [/tmp] $ python2.6 test.py latin1 replace '?????' [/tmp] $ python2.6 test.py latin1 xmlcharrefreplace '&#1488;&#1489;&#1490;&#1491;&#1492;' [/tmp] $ python2.6 test.py latin1 backslashreplace '\u05d0\u05d1\u05d2\u05d3\u05d4' 
+12
source

When you decode a string in unicode using someString.decode('cp1255') , you have an abstract representation of some Hebrew text in unicode. (This part succeeds!) When you use print , you need a specific, encoded representation in a specific encoding. It seems that your problem is not with decoding, but with print .

To print, simply print someString if your terminal understands cp1255 or " print someString.decode('cp1255').encode('the_encoding_your_terminal_does_understand') ". If you do not want the resulting fingerprint to be readable in Hebrew, print repr(someString.decode('cp1255')) also gives you a meaningful representation of the unicode abstract string.

+3
source

Is someString maybe not a normal string, but a unicode string, how could you believe in your sampleString ?

 >>> print '\xe0\xe1\xe2\xe3\xe4'.decode('cp1255') <hebrew characters> >>> print u'\xe0\xe1\xe2\xe3\xe4'.decode('cp1255') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "[...]/encodings/cp1255.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeEncodeError: 'ascii' codec can't encode characters [...] 
0
source

When printing, an encode error appears, so most likely it decodes perfectly, you simply can not print the result correctly. Try running chcp 65001 on the command line before running Python code.

0
source

All Articles