How to free memory used by pandas framework?

Question

How to free memory used by pandas framework?

I have a really big csv file that I opened in pandas as follows ....

import pandas df = pandas.read_csv('large_txt_file.txt')

As soon as I do this, the memory usage is increased by 2 GB, which is expected because this file contains millions of lines. My problem arises when I need to free this memory. I ran ....

 del df

However, memory usage has not decreased. Is this the wrong approach to release the memory used by the pandas data frame? If so, what is the right way?

+30

python pandas memory

b10hazard Aug 23 '16 at 12:17

source share

5 answers

Wilfred hugs · Answer 1 · 2016-09-07T19:25:24+0000

Reducing memory usage in Python is difficult because Python does not actually dump memory back to the operating system . If you delete objects, then memory is available for new Python objects, but not free() 'd back to the system ( see this question ).

If you stick with numpy numeric arrays, they are freed, but the objects in the box are not.

 >>> import os, psutil, numpy as np >>> def usage(): ... process = psutil.Process(os.getpid()) ... return process.get_memory_info()[0] / float(2 ** 20) ... >>> usage() # initial memory usage 27.5 >>> arr = np.arange(10 ** 8) # create a large array without boxing >>> usage() 790.46875 >>> del arr >>> usage() 27.52734375 # numpy just free()'d the array >>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects >>> usage() 3135.109375 >>> del arr >>> usage() 2372.16796875 # numpy frees the array, but python keeps the heap big

Reducing the number of data frames

Python saves our memory at a high watermark, but we can reduce the total number of frame frames we create. When changing your data frame, prefer inplace=True , so you are not creating a copy.

Another common problem is copying previously created data frames in ipython:

 In [1]: import pandas as pd In [2]: df = pd.DataFrame({'foo': [1,2,3,4]}) In [3]: df + 1 Out[3]: foo 0 2 1 3 2 4 3 5 In [4]: df + 2 Out[4]: foo 0 3 1 4 2 5 3 6 In [5]: Out # Still has all our temporary DataFrame objects! Out[5]: {3: foo 0 2 1 3 2 4 3 5, 4: foo 0 3 1 4 2 5 3 6}

You can fix this by typing %reset Out to clear the history. Alternatively, you can configure how much history ipython supports with ipython --cache-size=5 (default is 1000).

Reduce file block size

Avoid using dtypes objects whenever possible.

 >>> df.dtypes foo float64 # 8 bytes per value bar int64 # 8 bytes per value baz object # at least 48 bytes per value, often more

Values with a dtype object are put in a box, which means that the numpy array just contains a pointer, and you have a full Python object on the heap for every value in your data framework. This includes strings.

While numpy supports fixed-size strings in arrays, pandas does not ( this caused user confusion ). This can be significant:

 >>> import numpy as np >>> arr = np.array(['foo', 'bar', 'baz']) >>> arr.dtype dtype('S3') >>> arr.nbytes 9 >>> import sys; import pandas as pd >>> s = pd.Series(['foo', 'bar', 'baz']) dtype('O') >>> sum(sys.getsizeof(x) for x in s) 120

You can avoid using string columns or find a way to represent string data as numbers.

If you have a data framework that contains many duplicate values (NaN is very common), you can use a sparse data structure to reduce the amount of memory using:

 >>> df1.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 39681584 entries, 0 to 39681583 Data columns (total 1 columns): foo float64 dtypes: float64(1) memory usage: 605.5 MB >>> df1.shape (39681584, 1) >>> df1.foo.isnull().sum() * 100. / len(df1) 20.628483479893344 # so 20% of values are NaN >>> df1.to_sparse().info() <class 'pandas.sparse.frame.SparseDataFrame'> Int64Index: 39681584 entries, 0 to 39681583 Data columns (total 1 columns): foo float64 dtypes: float64(1) memory usage: 543.0 MB

View memory usage

You can view memory usage ( docs ):

 >>> df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 39681584 entries, 0 to 39681583 Data columns (total 14 columns): ... dtypes: datetime64[ns](1), float64(8), int64(1), object(4) memory usage: 4.4+ GB

As with pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage, including objects.

Ami tavory · Answer 2 · 2016-08-23T12:31:01+0000

As noted in the comments, there are some things you can try: gc.collect (@EdChum) can, for example, clear stuff. At least in my experience, these things sometimes work, and often not.

There is one thing that always works, however, because it runs on the OS, not the language level.

Suppose you have a function that creates an intermediate huge DataFrame and returns a smaller result (which could also be a DataFrame):

 def huge_intermediate_calc(something): ... huge_df = pd.DataFrame(...) ... return some_aggregate

Then if you do something like

 import multiprocessing result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]

Then the function is executed in another process . When this process ends, the OS returns all the resources that it used. There really can't do anything Python, pandas, the garbage collector, to stop this.

hardi · Answer 3 · 2018-03-07T04:40:55+0000

This solves the problem of freeing memory for me!

 del [[df_1,df_2]] gc.collect() df_1=pd.DataFrame() df_2=pd.DataFrame()

the data frame will be explicitly set to null

Marlon abeykoon · Answer 4 · 2016-08-23T12:29:48+0000

del df will not be deleted if there is a link to df at the time of deletion. Therefore, you need to remove all references to it using del df to free up memory.

So, all instances attached to df must be removed to cause garbage collection.

Use objgragh to check what holds on to objects.

Yury · Answer 5 · 2017-12-30T12:00:26+0000

I'm not sure, but you can set df for an empty data frame, so the df size will be reduced

 import sys df=pd.DataFrame() print("Size of dataframe", sys.getsizeof(df))

Please correct me if I am wrong

How to free memory used by pandas framework?

Reducing the number of data frames

Reduce file block size

View memory usage

More articles: