Problem with python: unicode

Question

Problem with python: unicode

I am trying to decode a string taken from a file:

file = open ("./Downloads/lamp-post.csv", 'r') data = file.readlines() data[0]

'\ XFF \ xfeK \ x00e \ x00y \ x00w \ x00o \ x00r \ x00d \ x00 \ t \ x00C \ x00o \ x00m \ x00p \ x00e \ x00t \ x00i \ x00t \ x00i \ x00o \ x00n \ x00 \ t \ x00G \ x00l \ x00o \ x00b \ x00a \ x00l \ x00 \ X00M \ x00o \ x00n \ x00t \ x00h \ x00l \ x00y \ x00 \ X00S \ x00e \ x00a \ x00r \ x00c \ x00h \ x00e \ x00s \ x00 \ t \ x00D \ x00e \ x00c \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00N \ x00o \ x00v \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00O \ x00c \ x00t \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00S \ x00e \ x00p \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00A \ x00u \ x00g \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00J \ x00u \ x00l \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00J \ x00u \ x00n \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00M \ x00a \ x00y \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00A \ x00p \ x00r \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00M \ x00a \ x00r \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00F \ x00e \ x00b \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00J \ x00a \ x00n \ x00 \ X002 \ X000 \ X001 \ X000 \ x00 \ t \ x00A \ x00d \ x00 \ X00s \ x00h \ x00a \ x00r \ x00e \ x00 \ t \ x00s \ x00e \ x00a \ x00r \ x00c \ x00h \ x00 \ X00s \ x00h \ x00a \ x00r \ x00e \ x00 \ t \ x00E \ x00s \ x00t \ x00i \ x00m \ x00a \ x00t \ x00e \ x00d \ x00 \ X00A \ x00v \ x00g \ x00. \ X0 0 \ X00C \ x00P \ x00C \ x00 \ t \ x00E \ x00x \ x00t \ x00r \ x00a \ x00c \ x00t \ x00e \ x00d \ x00 \ X00F \ x00r \ x00o \ x00m \ x00 \ X00W \ x00e \ x00b \ x00 \ X00P \ x00a \ x00g \ x00e \ x00 \ t \ x00L \ x00o \ x00c \ x00a \ x00l \ x00 \ X00M \ x00o \ x00n \ x00t \ x00h \ x00l \ x00y \ x00 \ X00S \ x00e \ x00a \ x00r \ x00c \ x00h \ x00e \ x00s \ x00 \ n '

Adding ignore really doesn't help ...:

In [69]: data [2] Out [69]: u '\ u6700 \ u6100 \ u700 \ u6400 \ u6e00 \ u2000 \ u6c00 \ u600 \ u6d00 \ u7000 \ u2000 \ u7000 \ u6f00 \ u7300 \ u7400 \ u0900 \ u3000 \ u2e00 \ u3900 \ u3400 \ u0900 \ u3800 \ u3800 \ u3000 \ u0900 \ u2d00 \ u0900 \ u3300 \ u3200 \ u3000 \ u0900 \ u3300 \ u3900 \ u3000 \ u0900 \ u3300 \ u3900 \ u3000 \ u0900 \ u3400 \ u343 \ u0900 \ u3500 \ u3900 \ u3000 \ u0900 \ u3500 \ u3900 \ u3000 \ u0900 \ u3700 \ u3200 \ u3000 \ u0900 \ u3700 \ u3200 \ u3000 \ u0900 \ u3300 \ u3900 \ u3000 \ u0900 \ u3300 \ u3200 \ u3000 \ u3200 \ u3600 \ u3000 \ u0900 \ u2d00 \ u0900 \ u2d00 \ u0900 \ ua300 \ u3200 \ u2e00 \ u3100 \ u3800 \ u0900 \ u2d00 \ u0900 \ u3400 \ u3800 \ u3000 \ u0a00 '
In [70]: data [2] .decode ("utf-8", "Replace") ---------------------------- ---------------------- ------------------------- Traceback (last last call)
/ Users / oleg / in ()
/opt/local/lib/python2.5/encodings/utf_8.py in decoding (input, errors) 14 15 def decode (input, errors = 'strict'): ---> 16 return codecs.utf_8_decode (input, errors , True) 17 18 of the class IncrementalEncoder (codecs.IncrementalEncoder):
: ascii codec cannot encode characters at positions 0-87: serial number not in range (128)
In [71]:

+8

python unicode

Oleg Tarasenko Jan 19 '11 at 13:06

source share

3 answers

This file is an UTF-16-LE encoded file with the original specification.

 import codecs fp= codecs.open("a", "r", "utf-16") lines= fp.readlines()

+6

tzot Feb 13 '11 at 11:42

source share

EDIT

Since you posted 2.7, this is 2.7 solution:

 file = open("./Downloads/lamp-post.csv", "r") data = [line.decode("utf-16", "replace") for line in file]

Ignoring non-convertible characters:

 file = open("./Downloads/lamp-post.csv", "r") data = [line.decode("utf-16", "ignore") for line in file]

+3

orlp Jan 19 '11 at 13:08

source share

Sven marnach · Accepted Answer · 2011-01-19T13:10:53+0000

This is similar to UTF-16 data. Therefore try

 data[0].rstrip("\n").decode("utf-16")

Edit (for your update): Try to decode the whole file immediately, i.e.

 data = open(...).read() data.decode("utf-16")

The problem is that line breaks in UTF-16 are "\ n \ x00", but using readlines() will split into "\ n", leaving the character "\ x00" for the next line.

Problem with python: unicode

More articles: