I wanted to get a much faster random forest classifier than the one that Weka had, I first tried the C ++ Shark implementation (results: slight speed improvement, crash in correctly classified instances), and then tested Python Scikit-learn. I read on many sites and in newspapers that Weka does not work well compared to Scikit, WiseRF ...
After my first attempt with a forest of 100 trees:
Training time: Weka ~ 170s VS Scikit ~ 31s
Prediction results on the same test set: Weka ~ 90% correctly classified VS Scikit score ~ 45% !!!
=> Scikit RF is fast, but it classifies this first attempt very poorly.
I configured the Scikit parameters RandomForestClassifier and managed to get an estimate close to 70%, but the scikit speed decreased almost to Weka's performance (bootstrap = False, min_samples_leaf = 3, min_samples_split = 1, criterion = 'entropy', max_features = 40, max_depth = 6). I have a lot of missing values, and scikit cannot handle them out of the box, so I tried many different strategies (all Imputer strategies skip instances with missing values, are replaced with 0 or extreme values) and reach 75%.
So, at this stage, the Scikit RandomForestClassifier performs 75% (compared with 90% with weka) and builds a model in the 78s (using 6 cores versus 170 with one core with Weka). I am very surprised by these results. I tested ExtraTrees, which works very well in terms of speed, but still achieves an average of 75% correct classification.
Do you have any ideas what I am missing?
My data: ~ 100 functions, ~ 100,000 copies, missing values, classification prediction (price forecast).