Reducing memory usage in Python is difficult because Python does not actually dump memory back to the operating system . If you delete objects, then memory is available for new Python objects, but not free() 'd back to the system ( see this question ).
If you stick with numpy numeric arrays, they are freed, but the objects in the box are not.
>>> import os, psutil, numpy as np >>> def usage(): ... process = psutil.Process(os.getpid()) ... return process.get_memory_info()[0] / float(2 ** 20) ... >>> usage()
Reducing the number of data frames
Python saves our memory at a high watermark, but we can reduce the total number of frame frames we create. When changing your data frame, prefer inplace=True , so you are not creating a copy.
Another common problem is copying previously created data frames in ipython:
In [1]: import pandas as pd In [2]: df = pd.DataFrame({'foo': [1,2,3,4]}) In [3]: df + 1 Out[3]: foo 0 2 1 3 2 4 3 5 In [4]: df + 2 Out[4]: foo 0 3 1 4 2 5 3 6 In [5]: Out
You can fix this by typing %reset Out to clear the history. Alternatively, you can configure how much history ipython supports with ipython --cache-size=5 (default is 1000).
Reduce file block size
Avoid using dtypes objects whenever possible.
>>> df.dtypes foo float64
Values ββwith a dtype object are put in a box, which means that the numpy array just contains a pointer, and you have a full Python object on the heap for every value in your data framework. This includes strings.
While numpy supports fixed-size strings in arrays, pandas does not ( this caused user confusion ). This can be significant:
>>> import numpy as np >>> arr = np.array(['foo', 'bar', 'baz']) >>> arr.dtype dtype('S3') >>> arr.nbytes 9 >>> import sys; import pandas as pd >>> s = pd.Series(['foo', 'bar', 'baz']) dtype('O') >>> sum(sys.getsizeof(x) for x in s) 120
You can avoid using string columns or find a way to represent string data as numbers.
If you have a data framework that contains many duplicate values ββ(NaN is very common), you can use a sparse data structure to reduce the amount of memory using:
>>> df1.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 39681584 entries, 0 to 39681583 Data columns (total 1 columns): foo float64 dtypes: float64(1) memory usage: 605.5 MB >>> df1.shape (39681584, 1) >>> df1.foo.isnull().sum() * 100. / len(df1) 20.628483479893344
View memory usage
You can view memory usage ( docs ):
>>> df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 39681584 entries, 0 to 39681583 Data columns (total 14 columns): ... dtypes: datetime64[ns](1), float64(8), int64(1), object(4) memory usage: 4.4+ GB
As with pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage, including objects.