Reading CSV file in Pandas Dataframe with invalid characters (accents)

I am trying to read a csv file in a pandas dataframe. However, csv contains accents. I am using Python 2.7

I came across UnicodeDecodeError because there is an accent in the first column. I read a bunch of sites like this SO question about UTF-8 in CSV files , this blog post about news CSV errors , and this blog post about UTF-8 in Python 2.7 .

I used the answers I found from there to try and modify my code. I initially had:

 import pandas as pd #Create a dataframe with the data we are interested in df = pd.DataFrame.from_csv('MYDATA.csv') mode = lambda ts: ts.value_counts(sort=True).index[0] cols = df['CompanyName'].value_counts().index df['Calls'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts) 

Excetera. This worked, but now switching to "NÍ" and "Nê" as the client name gives an error:

 UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 7: invalid continuation byte 

I tried changing the line to df = pd.read_csv ('MYDATA.csv', encoding = 'utf-8') But this gives the same error.

So, I tried this from the sentences that I found while researching, but it also does not work, and I get the same error.

 import pandas as pd import csv def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs): csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs) for row in csv_reader: yield [unicode(cell, 'utf-8') for cell in row] reader = unicode_csv_reader(open('MYDATA.csv','rU'), dialect = csv.reader) #Create a dataframe with the data we are interested in df =pd.DataFrame(reader) 

It seems to me that it should not be difficult to read csv data into the pandas framework. Does anyone know an easier way?

Edit: It is really strange that if I delete the line with accented characters, I still get the error

UnicodeDecodeError: codec 'utf8' cannot decode byte 0xd0 at position 960: invalid continue byte.

This is strange since my csv test has 19 rows and 27 columns. But I hope that if I decode utf8 for all csv, it will fix the problem.

+5
source share
2 answers

Try adding this to the top of the script:

 import sys reload(sys) sys.setdefaultencoding('utf8') 
+1
source

I know this is very annoying when we encounter an error in read_csv. You can try this df = pd.read_csv (filename, sep = '', error_bad_lines = False). It can skip the wrong lines, it can save a lot of time.

-1
source

All Articles