How to free memory used by pandas framework?

I have a really big csv file that I opened in pandas as follows ....

import pandas df = pandas.read_csv('large_txt_file.txt') 

As soon as I do this, the memory usage is increased by 2 GB, which is expected because this file contains millions of lines. My problem arises when I need to free this memory. I ran ....

 del df 

However, memory usage has not decreased. Is this the wrong approach to release the memory used by the pandas data frame? If so, what is the right way?

+30
python pandas memory
source share
5 answers

Reducing memory usage in Python is difficult because Python does not actually dump memory back to the operating system . If you delete objects, then memory is available for new Python objects, but not free() 'd back to the system ( see this question ).

If you stick with numpy numeric arrays, they are freed, but the objects in the box are not.

 >>> import os, psutil, numpy as np >>> def usage(): ... process = psutil.Process(os.getpid()) ... return process.get_memory_info()[0] / float(2 ** 20) ... >>> usage() # initial memory usage 27.5 >>> arr = np.arange(10 ** 8) # create a large array without boxing >>> usage() 790.46875 >>> del arr >>> usage() 27.52734375 # numpy just free()'d the array >>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects >>> usage() 3135.109375 >>> del arr >>> usage() 2372.16796875 # numpy frees the array, but python keeps the heap big 

Reducing the number of data frames

Python saves our memory at a high watermark, but we can reduce the total number of frame frames we create. When changing your data frame, prefer inplace=True , so you are not creating a copy.

Another common problem is copying previously created data frames in ipython:

 In [1]: import pandas as pd In [2]: df = pd.DataFrame({'foo': [1,2,3,4]}) In [3]: df + 1 Out[3]: foo 0 2 1 3 2 4 3 5 In [4]: df + 2 Out[4]: foo 0 3 1 4 2 5 3 6 In [5]: Out # Still has all our temporary DataFrame objects! Out[5]: {3: foo 0 2 1 3 2 4 3 5, 4: foo 0 3 1 4 2 5 3 6} 

You can fix this by typing %reset Out to clear the history. Alternatively, you can configure how much history ipython supports with ipython --cache-size=5 (default is 1000).

Reduce file block size

Avoid using dtypes objects whenever possible.

 >>> df.dtypes foo float64 # 8 bytes per value bar int64 # 8 bytes per value baz object # at least 48 bytes per value, often more 

Values ​​with a dtype object are put in a box, which means that the numpy array just contains a pointer, and you have a full Python object on the heap for every value in your data framework. This includes strings.

While numpy supports fixed-size strings in arrays, pandas does not ( this caused user confusion ). This can be significant:

 >>> import numpy as np >>> arr = np.array(['foo', 'bar', 'baz']) >>> arr.dtype dtype('S3') >>> arr.nbytes 9 >>> import sys; import pandas as pd >>> s = pd.Series(['foo', 'bar', 'baz']) dtype('O') >>> sum(sys.getsizeof(x) for x in s) 120 

You can avoid using string columns or find a way to represent string data as numbers.

If you have a data framework that contains many duplicate values ​​(NaN is very common), you can use a sparse data structure to reduce the amount of memory using:

 >>> df1.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 39681584 entries, 0 to 39681583 Data columns (total 1 columns): foo float64 dtypes: float64(1) memory usage: 605.5 MB >>> df1.shape (39681584, 1) >>> df1.foo.isnull().sum() * 100. / len(df1) 20.628483479893344 # so 20% of values are NaN >>> df1.to_sparse().info() <class 'pandas.sparse.frame.SparseDataFrame'> Int64Index: 39681584 entries, 0 to 39681583 Data columns (total 1 columns): foo float64 dtypes: float64(1) memory usage: 543.0 MB 

View memory usage

You can view memory usage ( docs ):

 >>> df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 39681584 entries, 0 to 39681583 Data columns (total 14 columns): ... dtypes: datetime64[ns](1), float64(8), int64(1), object(4) memory usage: 4.4+ GB 

As with pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage, including objects.

+31
source share

As noted in the comments, there are some things you can try: gc.collect (@EdChum) can, for example, clear stuff. At least in my experience, these things sometimes work, and often not.

There is one thing that always works, however, because it runs on the OS, not the language level.

Suppose you have a function that creates an intermediate huge DataFrame and returns a smaller result (which could also be a DataFrame):

 def huge_intermediate_calc(something): ... huge_df = pd.DataFrame(...) ... return some_aggregate 

Then if you do something like

 import multiprocessing result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0] 

Then the function is executed in another process . When this process ends, the OS returns all the resources that it used. There really can't do anything Python, pandas, the garbage collector, to stop this.

+9
source share

This solves the problem of freeing memory for me!

 del [[df_1,df_2]] gc.collect() df_1=pd.DataFrame() df_2=pd.DataFrame() 

the data frame will be explicitly set to null

+1
source share

del df will not be deleted if there is a link to df at the time of deletion. Therefore, you need to remove all references to it using del df to free up memory.

So, all instances attached to df must be removed to cause garbage collection.

Use objgragh to check what holds on to objects.

0
source share

I'm not sure, but you can set df for an empty data frame, so the df size will be reduced

 import sys df=pd.DataFrame() print("Size of dataframe", sys.getsizeof(df)) 

Please correct me if I am wrong

0
source share

All Articles