I have a large DataFrame loaded from a csv file (about 300 MB).
From this, I extract several dozen functions for use in RandomForestClassifier : some of the functions are simply derived from columns in the data, for example:
feature1 = data["SomeColumn"].apply(len) feature2 = data["AnotherColumn"]
And others are created as new DataFrame from numpy arrays using an index on the original data frame:
feature3 = pandas.DataFrame(count_array, index=data.index)
All these functions are then combined into one DataFrame :
features = feature1.join(feature2)
And I train a random forest classifier:
classifier = RandomForestClassifier( n_estimators=100, max_features=None, verbose=2, compute_importances=True, n_jobs=n_jobs, random_state=0, ) classifier.fit(features, data["TargetColumn"])
RandomForestClassifier works great with these functions, building a tree takes O (hundreds of megabytes of memory). However : if after loading my data I take a small subset of it:
data_slice = data[data['somecolumn'] > value]
Then, building a tree for my random forest unexpectedly requires a lot of gigabytes of memory, even if the size of the DataFrame objects DataFrame now O (10%) of the original.
I can believe that this may be due to the fact that the sliced ββview of the data does not allow for faster fragments (although I do not see how this can be extended to an array of functions), so I tried
data = pandas.DataFrame(data_slice, copy=True)
but it doesnβt help.
- Why does using a subset of data greatly increase memory usage?
- Is there any other way to compensate / reorder a
DataFrame that can make things more efficient again?
python pandas scikit-learn
James
source share