Pandas & Scikit: memory usage when slicing a DataFrame

Question

Pandas & Scikit: memory usage when slicing a DataFrame

I have a large DataFrame loaded from a csv file (about 300 MB).

From this, I extract several dozen functions for use in RandomForestClassifier : some of the functions are simply derived from columns in the data, for example:

  feature1 = data["SomeColumn"].apply(len) feature2 = data["AnotherColumn"]

And others are created as new DataFrame from numpy arrays using an index on the original data frame:

 feature3 = pandas.DataFrame(count_array, index=data.index)

All these functions are then combined into one DataFrame :

 features = feature1.join(feature2) # etc...

And I train a random forest classifier:

 classifier = RandomForestClassifier( n_estimators=100, max_features=None, verbose=2, compute_importances=True, n_jobs=n_jobs, random_state=0, ) classifier.fit(features, data["TargetColumn"])

RandomForestClassifier works great with these functions, building a tree takes O (hundreds of megabytes of memory). However : if after loading my data I take a small subset of it:

 data_slice = data[data['somecolumn'] > value]

Then, building a tree for my random forest unexpectedly requires a lot of gigabytes of memory, even if the size of the DataFrame objects DataFrame now O (10%) of the original.

I can believe that this may be due to the fact that the sliced view of the data does not allow for faster fragments (although I do not see how this can be extended to an array of functions), so I tried

 data = pandas.DataFrame(data_slice, copy=True)

but it doesn’t help.

Why does using a subset of data greatly increase memory usage?
Is there any other way to compensate / reorder a DataFrame that can make things more efficient again?

+2

python pandas scikit-learn

James Sep 01 '12 at 11:54

source share

1 answer

ogrisel · Accepted Answer · 2012-09-01T13:05:59+0000

RandomForestClassifier copies a data set several times in memory, especially when n_jobs large. We know about these problems, and it is important to fix them:

I am currently working on a subclass of the multiprocessing.Pool class of the standard library that will not copy memory when instances of numpy.memmap are passed to subprocess workers. This will allow you to share the source data memory + some pre-calculated data structures between workers. As soon as this is fixed, I will close this problem in the github tracker .
There is ongoing refactoring that will further reduce RandomForestClassifier memory RandomForestClassifier by two. However, the current refactoring state is two times slower than the master, so further work is still required.

However, none of these fixes will lead to release 0.12, which is scheduled to be released next week. Most likely, they will be made in 0.13 (planned for release in 3-4 months), but the offshore will be available in the master branch much earlier.

Pandas & Scikit: memory usage when slicing a DataFrame

More articles: