Reading CSV file in Pandas Dataframe with invalid characters (accents)

Question

Reading CSV file in Pandas Dataframe with invalid characters (accents)

I am trying to read a csv file in a pandas dataframe. However, csv contains accents. I am using Python 2.7

I came across UnicodeDecodeError because there is an accent in the first column. I read a bunch of sites like this SO question about UTF-8 in CSV files , this blog post about news CSV errors , and this blog post about UTF-8 in Python 2.7 .

I used the answers I found from there to try and modify my code. I initially had:

 import pandas as pd #Create a dataframe with the data we are interested in df = pd.DataFrame.from_csv('MYDATA.csv') mode = lambda ts: ts.value_counts(sort=True).index[0] cols = df['CompanyName'].value_counts().index df['Calls'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)

Excetera. This worked, but now switching to "NÍ" and "Nê" as the client name gives an error:

 UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 7: invalid continuation byte

I tried changing the line to df = pd.read_csv ('MYDATA.csv', encoding = 'utf-8') But this gives the same error.

So, I tried this from the sentences that I found while researching, but it also does not work, and I get the same error.

 import pandas as pd import csv def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs): csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs) for row in csv_reader: yield [unicode(cell, 'utf-8') for cell in row] reader = unicode_csv_reader(open('MYDATA.csv','rU'), dialect = csv.reader) #Create a dataframe with the data we are interested in df =pd.DataFrame(reader)

It seems to me that it should not be difficult to read csv data into the pandas framework. Does anyone know an easier way?

Edit: It is really strange that if I delete the line with accented characters, I still get the error

UnicodeDecodeError: codec 'utf8' cannot decode byte 0xd0 at position 960: invalid continue byte.

This is strange since my csv test has 19 rows and 27 columns. But I hope that if I decode utf8 for all csv, it will fix the problem.

+5

python pandas csv utf-8 dataframe

jenryb Jun 19 '15 at 19:18

source share

2 answers

GNMO11 · Answer 1 · 2015-06-19T19:25:00+0000

Try adding this to the top of the script:

 import sys reload(sys) sys.setdefaultencoding('utf8')

ye jiawei · Answer 2 · 2016-04-14T01:11:21+0000

I know this is very annoying when we encounter an error in read_csv. You can try this df = pd.read_csv (filename, sep = '', error_bad_lines = False). It can skip the wrong lines, it can save a lot of time.

Reading CSV file in Pandas Dataframe with invalid characters (accents)

More articles: