Random Forest Classification - SciKit vs. Weka Forecasting with 100 Functions

I wanted to get a much faster random forest classifier than the one that Weka had, I first tried the C ++ Shark implementation (results: slight speed improvement, crash in correctly classified instances), and then tested Python Scikit-learn. I read on many sites and in newspapers that Weka does not work well compared to Scikit, WiseRF ...

After my first attempt with a forest of 100 trees:

Training time: Weka  ~ 170s VS Scikit ~ 31s
Prediction results on the same test set: Weka ~ 90% correctly classified VS Scikit score ~ 45% !!!

=> Scikit RF is fast, but it classifies this first attempt very poorly.

I configured the Scikit parameters RandomForestClassifier and managed to get an estimate close to 70%, but the scikit speed decreased almost to Weka's performance (bootstrap = False, min_samples_leaf = 3, min_samples_split = 1, criterion = 'entropy', max_features = 40, max_depth = 6). I have a lot of missing values, and scikit cannot handle them out of the box, so I tried many different strategies (all Imputer strategies skip instances with missing values, are replaced with 0 or extreme values) and reach 75%.

So, at this stage, the Scikit RandomForestClassifier performs 75% (compared with 90% with weka) and builds a model in the 78s (using 6 cores versus 170 with one core with Weka). I am very surprised by these results. I tested ExtraTrees, which works very well in terms of speed, but still achieves an average of 75% correct classification.

Do you have any ideas what I am missing?

My data: ~ 100 functions, ~ 100,000 copies, missing values, classification prediction (price forecast).

+4
source share
3 answers

Closing the discussion in the comments so that StackOverflow marks this question as an answer:

-, OP , , GridSearchCV.

, -, .

+3

Weka Scikit-learn Random Forest (?). , , . Weka , Scikit-learn . Weka: , . , . random_state = 1 ( Weka), shuffle = True Scikit-learn -, bootstrap = True . Weka. .

classifier = ensemble.RandomForestClassifier(n_estimators=300,  max_depth=30, min_samples_leaf=1, min_samples_split=1, random_state=1, bootstrap=True, criterion='entropy', n_jobs=-1)

cv = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=1)
grid_search = GridSearchCV(classifier, param_grid=param_grid, cv=cv)
0

, Python Sklearn Weka. . AUC 0,95 0,7 Weka. , Python Weka? !

0
source

All Articles