Memory usage in while loop

My code includes this while :

 while A.shape[0] > 0: idx = A.score.values.argmax() one_center = A.coordinate.iloc[idx] # peak_centers and peak_scores are python lists peak_centers.append(one_center) peak_scores.append(A.score.iloc[idx]) # exclude the coordinates around the selected peak A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)] 

A is a pandas DataFrame that looks like this:

  score coordinate 0 0.158 1 1 0.167 2 2 0.175 3 3 0.183 4 4 0.190 5 

I try to find the maximum score (peak) in A , then exclude some coordinates (pair in this case) around the previously found peak, then find the next peak and so on.

A here is a very big pandas DataFrame . Before starting this while ipython session used 20% of the machine memory. I thought that executing this while would reduce memory consumption as I exclude some data from the DataFrame . However, I observe that memory usage continues to grow, and at some point, machine memory will run out.

Did I miss something? Do I need to explicitly free memory?

Here is a short script that can replicate behavior using random data:

 import numpy as np import pandas as pd A = pd.DataFrame({'score':np.random.random(132346018), 'coordinate':np.arange(1, 132346019)}) peak_centers = [] peak_scores = [] exclusion = 147 while A.shape[0] > 0: idx = A.score.values.argmax() one_center = A.coordinate.iloc[idx] # peak_centers and peak_scores are python lists peak_centers.append(one_center) peak_scores.append(A.score.iloc[idx]) # exclude the coordinates around the selected peak A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)] # terminated the loop after memory consumption gets to 90% of machine memory # but peak_centers and peak_scores are still short lists print len(peak_centers) # output is 16 
+5
source share
2 answers

Your DataFrame just too big. The memory load doubles when this line is executed:

 A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)] 

This is because you are assigning a new value to A , so memory is allocated for the new DataFrame , and the old one is for filtering. The new one is almost the same size as the old one, because you select almost all data points. This is a sufficient amount of memory for two copies of A and that without taking into account the additional memory for accounting performed by the loc implementation.

Apparently loc causes pandas to allocate enough memory for an extra copy of the data. I do not know why this is so. I guess this is performance optimization. This means that you end up consuming four times the size of the DataFrame with maximum memory DataFrame . As soon as loc is executed and the freed memory is freed & mdash, which you can impose by calling gc.collect() - the memory load drops to twice the size of the DataFrame . The next time you call loc everything doubles, and you return to fourfold loading. Trash - pack up again and you'll be back to double. This will continue as long as you like.

To check what happens, run this modified version of your code:

 import numpy as np import pandas as pd import gc A = pd.DataFrame({'score':np.random.random(32346018), 'coordinate':np.arange(1, 32346019)}) peak_centers = [] peak_scores = [] exclusion = 147 count = 0 while A.shape[0] > 0: gc.collect() # Force garbage collection. count += 1 # Increment the iteration count. print('iteration %d, shape %s' % (count, A.shape)) raw_input() # Wait for the user to press Enter. idx = A.score.values.argmax() one_center = A.coordinate.iloc[idx] # peak_centers and peak_scores are python lists peak_centers.append(one_center) peak_scores.append(A.score.iloc[idx]) print(len(peak_centers), len(peak_scores)) # exclude the coordinates around the selected peak A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)] 

Press the Enter key between iterations and monitor memory usage with top or a similar tool.

At the beginning of the first iteration, you will see x percent memory usage. At the second iteration, after calling loc for the first time, memory usage doubles to 2x . Subsequently, you will see that it reaches 4x during each call to loc , and then go to 2x after garbage collection.

+3
source

Use DataFrame.drop with inplace=True if you want to destroy mutation A without copying a large subset of data A

 places_to_drop = ~(A.coordinate - one_center).between(-exclusion, exclusion) A.drop(A.index[np.where(places_to_drop)], inplace=True) 

The place where the original use of loc ultimately ends is in the _NDFrameIndexer _getitem_iterable method. _LocIndexer is a child of the _NDFrameIndexer class and an instance of _LocIndexer is created and populates the loc property of the DataFrame .

In particular, _getitem_iterable performs a boolean index check, which occurs in your case. Then a new array of Boolean locations is created (which is wasteful for memory when key already in a logical format).

 inds, = key.nonzero() 

and then finally the "correct" addresses are returned in a copy:

 return self.obj.take(inds, axis=axis, convert=False) 

From the code: key will be your boolean index (i.e. the result of the expression (A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion) ) and self.obj will be the parent instance of the DataFrame from which loc is called , so obj is just A here.

The DataFrame.take documentation explains that the default behavior is to create a copy. There is no mechanism in the current implementation of indexers that allows you to pass a keyword argument that will ultimately be used to execute take without creating a copy.

On any reasonable modern machine, using the drop method should be seamless for the size of the data you describe, so don't blame size A

+3
source

All Articles