I have a large file (19 GB or so) that I want to load into memory to perform aggregation on some columns.
The file is as follows:
id, col1, col2, col3, 1 , 12 , 15 , 13 2 , 18 , 15 , 13 3 , 14 , 15 , 13 3 , 14 , 185 , 213
Note that I use the (id, col1) columns for aggregation after loading into the data frame, also note that these keys can be repeated several times in a row, for example:
3 , 14 , 15 , 13 3 , 14 , 185 , 213
For a small file, the following script can do the job
import pandas as pd data = pd.read_csv("data_file", delimiter=",") data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
However, for a large file, I need to use chunksize when reading the csv file to limit the number of lines loaded into memory:
import pandas as pd data = pd.read_csv("data_file", delimiter=",", chunksize=1000000) data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
In the latter case, there will be a problem if the lines where (id, col1) are similar are separated in different files. How can I handle this?
EDIT
As pointed out by @EdChum, there is a potential workaround, that is, itโs not easy to add groupby results to the new csv and read it back and aggregate again until the df size changes.
This, however, has a worse scenario that is not being processed, i.e.:
when all files (or enough files, because memory cannot process) have the same problems (id, col1) at the end. This will cause the system to return a MemoryError.