Pandas groupby with sum () to a large csv file?

I have a large file (19 GB or so) that I want to load into memory to perform aggregation on some columns.

The file is as follows:

id, col1, col2, col3, 1 , 12 , 15 , 13 2 , 18 , 15 , 13 3 , 14 , 15 , 13 3 , 14 , 185 , 213 

Note that I use the (id, col1) columns for aggregation after loading into the data frame, also note that these keys can be repeated several times in a row, for example:

 3 , 14 , 15 , 13 3 , 14 , 185 , 213 

For a small file, the following script can do the job

 import pandas as pd data = pd.read_csv("data_file", delimiter=",") data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum() 

However, for a large file, I need to use chunksize when reading the csv file to limit the number of lines loaded into memory:

 import pandas as pd data = pd.read_csv("data_file", delimiter=",", chunksize=1000000) data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum() 

In the latter case, there will be a problem if the lines where (id, col1) are similar are separated in different files. How can I handle this?

EDIT

As pointed out by @EdChum, there is a potential workaround, that is, itโ€™s not easy to add groupby results to the new csv and read it back and aggregate again until the df size changes.

This, however, has a worse scenario that is not being processed, i.e.:

when all files (or enough files, because memory cannot process) have the same problems (id, col1) at the end. This will cause the system to return a MemoryError.

+6
source share
2 answers

First, you can select a list of unique constants by reading csv using usecols - usecols=['id', 'col1'] . Then read csv chunks, concat chunks a subset of id and groupby. better to explain .

If it is better to use the col1 column, change constants = df['col1'].unique().tolist() . It depends on your data.

Or you can read only one column df = pd.read_csv(io.StringIO(temp), sep=",", usecols=['id']) , it depends on your data.

 import pandas as pd import numpy as np import io #test data temp=u"""id,col1,col2,col3 1,13,15,14 1,13,15,14 1,12,15,13 2,18,15,13 2,18,15,13 2,18,15,13 2,18,15,13 2,18,15,13 2,18,15,13 3,14,15,13 3,14,15,13 3,14,185,213""" df = pd.read_csv(io.StringIO(temp), sep=",", usecols=['id', 'col1']) #drop duplicities, from out you can choose constant df = df.drop_duplicates() print df # id col1 #0 1 13 #2 1 12 #3 2 18 #9 3 14 #for example list of constants constants = [1,2,3] #or column id to list of unique values constants = df['id'].unique().tolist() print constants #[1L, 2L, 3L] for i in constants: iter_csv = pd.read_csv(io.StringIO(temp), delimiter=",", chunksize=10) #concat subset with rows id == constant df = pd.concat([chunk[chunk['id'] == i] for chunk in iter_csv]) #your groupby function data = df.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum() print data.to_csv(index=False) #id,col1,col2,col3 #1,12,15,13 #1,13,30,28 # #id,col1,col2,col3 #2,18,90,78 # #id,col1,col2,col3 #3,14,215,239 
+1
source

dask solution

Dask.dataframe can almost do it unchanged

 $ cat so.csv id,col1,col2,col3 1,13,15,14 1,13,15,14 1,12,15,13 2,18,15,13 2,18,15,13 2,18,15,13 2,18,15,13 2,18,15,13 2,18,15,13 3,14,15,13 3,14,15,13 3,14,185,213 $ pip install dask[dataframe] $ ipython In [1]: import dask.dataframe as dd In [2]: df = dd.read_csv('so.csv', sep=',') In [3]: df.head() Out[3]: id col1 col2 col3 0 1 13 15 14 1 1 13 15 14 2 1 12 15 13 3 2 18 15 13 4 2 18 15 13 In [4]: df.groupby(['id', 'col1']).sum().compute() Out[4]: col2 col3 id col1 1 12 15 13 13 30 28 2 18 90 78 3 14 215 239 

No one wrote as_index=False for the group. We can get around this with assign .

 In [5]: df.assign(id_2=df.id, col1_2=df.col1).groupby(['id_2', 'col1_2']).sum().compute() Out[5]: id col1 col2 col3 id_2 col1_2 1 12 1 12 15 13 13 2 26 30 28 2 18 12 108 90 78 3 14 9 42 215 239 

How it works

We pull out the pieces and make groupbys in the same way as in your first example. As soon as we finish grouping and summing up each of the pieces, we will collect all the intermediate results together and make some more slightly different groupby.sum . This suggests that intermediate results will be memory related.

Parallelism

As a pleasant side effect, it will also work in parallel.

+6
source

All Articles