Improve row add performance on Pandas DataFrames

I run a basic script that moves through a nested dictionary, captures data from each record and adds it to the Pandas DataFrame. The data looks something like this:

data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...} 

In total, he has several million records. The script itself is as follows:

 city = ["SomeCity"] df = DataFrame({}, columns=['Date', 'HouseID', 'Price']) for city in cities: for dateRun in data[city]: for record in data[city][dateRun]: recSeries = Series([record['Timestamp'], record['Id'], record['Price']], index = ['Date', 'HouseID', 'Price']) FredDF = FredDF.append(recSeries, ignore_index=True) 

This is very slow, however. Before looking for a way to parallelize, I just want to make sure that I am missing something obvious, to make it work faster like this, since I'm still quite new to Pandas.

+5
source share
2 answers

Using the BrenBarn clause, I simply rebuilt the original dictionary into a new dictionary that was formatted correctly to take advantage of the expected from_dict structure. The reorganization of the dictionary was carried out very quickly, and then it is just a matter of calling from_dict with this new dictionary.

All this happened from loading data to recording data in about 12 seconds against the original hour or so. Much better!

+5
source

I ran into a similar problem when I had to add multiple times to a DataFrame, but didn't know the values ​​before adding. I wrote a lightweight DataFrame such as a data structure that is just blists () under the hood. I use this to accumulate all the data, and then, when it is complete, converts the output to a Pandas DataFrame. Here is a link to my project, all open source, so I hope this helps others:

https://pypi.python.org/pypi/raccoon

+2
source

Source: https://habr.com/ru/post/1211025/


All Articles