Reading UTF-8 using specification using Python CSV module causes unwanted extra characters

Question

Reading UTF-8 using specification using Python CSV module causes unwanted extra characters

I am trying to read a CSV file with Python with the following code:

with open("example.txt") as f: c = csv.reader(f) for row in c: print row

My example.txt has only the following content:

 Hello world!

For files encoded with UTF-8 or ANSI, this gives me the expected result:

 > ["Hello world!"]

But if I save the file as UTF-8 with BOM, I get this output:

 > ["\xef\xbb\xbfHello world!"]

Since I have no control over the files that the user will use as input, I would like this to work with the BOM. How can I fix this problem? Is there something I need to do to make sure this works for other encodings too?

+12

python python-2.7 csv character-encoding byte-order-mark

Anders Nov 18 '15 at 16:34

source share

1 answer

Martin evans · Answer 1 · 2015-11-18T16:48:23+0000

You can use the Python unicodecsv module as follows:

 import unicodecsv with open('input.csv', 'rb') as f_input: csv_reader = unicodecsv.reader(f_input, encoding='utf-8-sig') print list(csv_reader)

So, for an input file containing the following in UTF-8 with the specification:

 c1,c2,c3,c4,c5,c6,c7,c8 1,2,3,4,5,6,7,8

This will display the following:

 [[u'c1', u'c2', u'c3', u'c4', u'c5', u'c6', u'c7', u'c8'], [u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8']]

The unicodecsv module can be installed using pip as follows:

 pip install unicodecsv

Reading UTF-8 using specification using Python CSV module causes unwanted extra characters

More articles: