Improve row add performance on Pandas DataFrames

Question

Improve row add performance on Pandas DataFrames

I run a basic script that moves through a nested dictionary, captures data from each record and adds it to the Pandas DataFrame. The data looks something like this:

data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...}

In total, he has several million records. The script itself is as follows:

 city = ["SomeCity"] df = DataFrame({}, columns=['Date', 'HouseID', 'Price']) for city in cities: for dateRun in data[city]: for record in data[city][dateRun]: recSeries = Series([record['Timestamp'], record['Id'], record['Price']], index = ['Date', 'HouseID', 'Price']) FredDF = FredDF.append(recSeries, ignore_index=True)

This is very slow, however. Before looking for a way to parallelize, I just want to make sure that I am missing something obvious, to make it work faster like this, since I'm still quite new to Pandas.

+5

python python-2.7 numpy pandas

Brideau Jan 13 '15 at 18:57

source share

2 answers

I ran into a similar problem when I had to add multiple times to a DataFrame, but didn't know the values before adding. I wrote a lightweight DataFrame such as a data structure that is just blists () under the hood. I use this to accumulate all the data, and then, when it is complete, converts the output to a Pandas DataFrame. Here is a link to my project, all open source, so I hope this helps others:

https://pypi.python.org/pypi/raccoon

+2

Ryan sheftel Sep 05 '16 at 21:45

source share

Brideau · Accepted Answer · 2015-01-14T18:29:51+0000

Using the BrenBarn clause, I simply rebuilt the original dictionary into a new dictionary that was formatted correctly to take advantage of the expected from_dict structure. The reorganization of the dictionary was carried out very quickly, and then it is just a matter of calling from_dict with this new dictionary.

All this happened from loading data to recording data in about 12 seconds against the original hour or so. Much better!

Improve row add performance on Pandas DataFrames

More articles: