Python pandas - trailing delimiter confuses read_csv

I am following examples from a Python book for data analysis. In particular, the 2012 election database from its chapter 9. The data is in a large CSV file, separated by a comma. But each line of the file has an additional trailing delimiter, which seems to confuse pandas.read_csv .

It handles the extra delimiter as if there is an extra column. So there is one more column than the headers are required. Then pandas.read_csv takes the first column as row labels. The overall effect is that the columns and headers no longer align - the first column becomes row labels, the second column is called the first heading, etc.

This is pretty annoying. Any idea how to say pandas.read_csv is doing the right thing? I could not find him.

The Great Book, BTW.

+6
source share
3 answers

I created a GitHub problem to be able to automatically handle this problem:

https://github.com/pydata/pandas/issues/2442

I think the FEC file format has changed a bit, causing this unpleasant problem - if you are using the http://github.com/pydata/pydata-book published here, which I hope I have this problem.

+2
source

For everyone who still finds it. Wes wrote a blogpost about it. The problem is if the string has too many values, it is considered as the name of the string.

This behavior can be changed by setting index_col=False as the read_csv option.

+6
source

Well, there is a very simple workaround. Add a dummy column to the header when reading the csv file in:

 cols = ... cols.append('') records = pandas.read_csv('filename.txt', skiprows=1, names=cols) 

Then the columns and the header are aligned again.

+3
source

All Articles