Interpretation of Scikit-Learn simulation results, additional classifiers of trees that differ from each other

I have a dataset on which I am developing a predictive model using the Extra Trees Classifier class. As the following code shows, the initial set of code shows that et_scores looked pretty disappointing, I ran, look below, and it looks better, then I made a training schedule, and everything does not look too hot. All in all, pretty confusing. Source:

from sklearn.ensemble import ExtraTreesClassifier from sklearn.cross_validation import cross_val_score #split the dataset for train and test combnum['is_train'] = np.random.uniform(0, 1, len(combnum)) <= .75 train, test = combnum[combnum['is_train']==True], combnum[combnum['is_train']==False] et = ExtraTreesClassifier(n_estimators=200, max_depth=None, min_samples_split=10, random_state=0) labels = train[list(label_columns)].values tlabels = test[list(label_columns)].values features = train[list(columns)].values tfeatures = test[list(columns)].values et_score = cross_val_score(et, features, labels.ravel(), n_jobs=-1) print("{0} -> ET: {1})".format(label_columns, et_score)) 

Gives me:

 ['Campaign_Response'] -> ET: [ 0.58746427 0.31725003 0.43522521]) 

Not so hot! then to my stretched data:

  et.fit(features,labels.ravel()) et.score(tfeatures,tlabels.ravel()) Out[16]:0.7434136771300448 

Not so bad then according to the training:

 et.score(features,labels.ravel()) Out[17]:0.85246473144769563 

Again, not bad, but nothing to do with the early score? Then do:

 from sklearn.learning_curve import validation_curve def plot_validation_curve(estimator, X, y, param_name, param_range, ylim=(0, 1.1), cv=5, n_jobs=-1, scoring=None): estimator_name = type(estimator).__name__ plt.title("Validation curves for %s on %s" % (param_name, estimator_name)) plt.ylim(*ylim); plt.grid() plt.xlim(min(param_range), max(param_range)) plt.xlabel(param_name) plt.ylabel("Score") train_scores, test_scores = validation_curve( estimator, X, y, param_name, param_range, cv=cv, n_jobs=n_jobs, scoring=scoring) train_scores_mean = np.mean(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) plt.semilogx(param_range, train_scores_mean, 'o-', color="r", label="Training score") plt.semilogx(param_range, test_scores_mean, 'o-', color="g", label="Cross-validation score") plt.legend(loc="best") print("Best test score: {:.4f}".format(test_scores_mean[-1])) 

and then:

 clf = ExtraTreesClassifier(max_depth=8) param_name = 'max_depth' param_range = [1, 2, 4, 8, 16, 32] plot_validation_curve(clf, features,labels.ravel(), param_name, param_range, scoring='roc_auc') 

give me a graph and a legend that don't seem to reflect the previous information:

 Best test score: 0.3592 

Learning curve

and finally sklearn metrics give me

 Accuracy:0.737 Classification report precision recall f1-score support 0 0.76 0.79 0.78 8311 1 0.70 0.66 0.68 6134 avg / total 0.74 0.74 0.74 14445 

It seems to me that it's best to understand this, who can help?

+4
source share
1 answer

What you feel here is that different cross-validation methods and classifier parameters lead to different estimates.

In your first experiment, you compare the results of cross_val_score with your own 75% / 25% random splitting. The cross_val_score method uses the StratifiedKFold method with K of 3 to determine folds. StratifiedKFold more or less preserves the data order, while your random split removes any natural data order by random sampling. This may explain the difference in performance, especially when your data has a certain dependence on the natural order. For example, if your data is ordered by timestamp, the characteristics of the data may change over time. This results in lower performance when train sets and tests come from different time periods, which will be the case for the StratifiedKFold sample.

In the second experiment, you use the default parameters for the classifier and cross-validation with 5 additions, which again leads to different results. For example, by default ExtraTreeClassifier uses 10 ratings, but in your first experiment you used 200 ratings - and you change the max_depth parameter. For interpretation, the max_depth parameter determines the complexity of the trees, and when learning only 10 trees, a large number of leaves leads to retraining, which exactly corresponds to what you see in the validation diagram. The best test result is 0.6, not 0.315, you should take the maximum score, not the last point.

I hope this helps with interpretation of the assessment and understanding of the differences. As the next steps, I would check the ordering of the data, and if it is temporary, I would examine it using visualization. If you expect such a drift also in the data that you want to predict at the end, you should not use random sampling - if you are sure that your training set reflects all the options, you can shuffle the data before the tests or set the Shuffle parameter StratifiedKFold to true. For the classifier, I would prefer to start with a simple RandomForestClassifier and set n_estimators to 100 before looking at ExtraTrees.

+3
source

All Articles