I have a large framework with approximately 392 million rows and 9 columns. I want to apply a filter in a dataset to retrieve a subset.
Here is my source dataset dh_activity_recos
dh_activity_approved = dh_activity_recos.loc[dh_activity_recos.approved_flag == 1]
Now, when I apply this filter, I get the following memory error:
Traceback (most recent call last):
File "/mnt01/eh-datasci/ravinder/working/final_recos_processing.py", line 144, in <module>
dh_activity_approved = dh_activity_recos.loc[dh_activity_recos.approved_flag == 1]
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1227, in __getitem__
return self._getitem_axis(key, axis=0)
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1344, in _getitem_axis
return self._getbool_axis(key, axis=axis)
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1239, in _getbool_axis
raise self._exception(detail)
KeyError: MemoryError()
I can not understand what is the reason. I checked with the command dir(); there are no other memory resources besides this large dataset. Moreover, I am doing this on a cloud with 128 GB of RAM, so I'm not sure why this error pops up.
source
share