Reading UTF-8 using specification using Python CSV module causes unwanted extra characters

I am trying to read a CSV file with Python with the following code:

with open("example.txt") as f: c = csv.reader(f) for row in c: print row 

My example.txt has only the following content:

 Hello world!

For files encoded with UTF-8 or ANSI, this gives me the expected result:

 > ["Hello world!"] 

But if I save the file as UTF-8 with BOM, I get this output:

 > ["\xef\xbb\xbfHello world!"] 

Since I have no control over the files that the user will use as input, I would like this to work with the BOM. How can I fix this problem? Is there something I need to do to make sure this works for other encodings too?

+12
python csv character-encoding byte-order-mark
source share
1 answer

You can use the Python unicodecsv module as follows:

 import unicodecsv with open('input.csv', 'rb') as f_input: csv_reader = unicodecsv.reader(f_input, encoding='utf-8-sig') print list(csv_reader) 

So, for an input file containing the following in UTF-8 with the specification:

 c1,c2,c3,c4,c5,c6,c7,c8 1,2,3,4,5,6,7,8 

This will display the following:

 [[u'c1', u'c2', u'c3', u'c4', u'c5', u'c6', u'c7', u'c8'], [u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8']] 

The unicodecsv module can be installed using pip as follows:

 pip install unicodecsv 
+6
source share

All Articles