Your DataFrame just too big. The memory load doubles when this line is executed:
A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]
This is because you are assigning a new value to A , so memory is allocated for the new DataFrame , and the old one is for filtering. The new one is almost the same size as the old one, because you select almost all data points. This is a sufficient amount of memory for two copies of A and that without taking into account the additional memory for accounting performed by the loc implementation.
Apparently loc causes pandas to allocate enough memory for an extra copy of the data. I do not know why this is so. I guess this is performance optimization. This means that you end up consuming four times the size of the DataFrame with maximum memory DataFrame . As soon as loc is executed and the freed memory is freed & mdash, which you can impose by calling gc.collect() - the memory load drops to twice the size of the DataFrame . The next time you call loc everything doubles, and you return to fourfold loading. Trash - pack up again and you'll be back to double. This will continue as long as you like.
To check what happens, run this modified version of your code:
import numpy as np import pandas as pd import gc A = pd.DataFrame({'score':np.random.random(32346018), 'coordinate':np.arange(1, 32346019)}) peak_centers = [] peak_scores = [] exclusion = 147 count = 0 while A.shape[0] > 0: gc.collect()
Press the Enter key between iterations and monitor memory usage with top or a similar tool.
At the beginning of the first iteration, you will see x percent memory usage. At the second iteration, after calling loc for the first time, memory usage doubles to 2x . Subsequently, you will see that it reaches 4x during each call to loc , and then go to 2x after garbage collection.
source share