I have a dataset on which I am developing a predictive model using the Extra Trees Classifier class. As the following code shows, the initial set of code shows that et_scores looked pretty disappointing, I ran, look below, and it looks better, then I made a training schedule, and everything does not look too hot. All in all, pretty confusing. Source:
from sklearn.ensemble import ExtraTreesClassifier from sklearn.cross_validation import cross_val_score #split the dataset for train and test combnum['is_train'] = np.random.uniform(0, 1, len(combnum)) <= .75 train, test = combnum[combnum['is_train']==True], combnum[combnum['is_train']==False] et = ExtraTreesClassifier(n_estimators=200, max_depth=None, min_samples_split=10, random_state=0) labels = train[list(label_columns)].values tlabels = test[list(label_columns)].values features = train[list(columns)].values tfeatures = test[list(columns)].values et_score = cross_val_score(et, features, labels.ravel(), n_jobs=-1) print("{0} -> ET: {1})".format(label_columns, et_score))
Gives me:
['Campaign_Response'] -> ET: [ 0.58746427 0.31725003 0.43522521])
Not so hot! then to my stretched data:
et.fit(features,labels.ravel()) et.score(tfeatures,tlabels.ravel()) Out[16]:0.7434136771300448
Not so bad then according to the training:
et.score(features,labels.ravel()) Out[17]:0.85246473144769563
Again, not bad, but nothing to do with the early score? Then do:
from sklearn.learning_curve import validation_curve def plot_validation_curve(estimator, X, y, param_name, param_range, ylim=(0, 1.1), cv=5, n_jobs=-1, scoring=None): estimator_name = type(estimator).__name__ plt.title("Validation curves for %s on %s" % (param_name, estimator_name)) plt.ylim(*ylim); plt.grid() plt.xlim(min(param_range), max(param_range)) plt.xlabel(param_name) plt.ylabel("Score") train_scores, test_scores = validation_curve( estimator, X, y, param_name, param_range, cv=cv, n_jobs=n_jobs, scoring=scoring) train_scores_mean = np.mean(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) plt.semilogx(param_range, train_scores_mean, 'o-', color="r", label="Training score") plt.semilogx(param_range, test_scores_mean, 'o-', color="g", label="Cross-validation score") plt.legend(loc="best") print("Best test score: {:.4f}".format(test_scores_mean[-1]))
and then:
clf = ExtraTreesClassifier(max_depth=8) param_name = 'max_depth' param_range = [1, 2, 4, 8, 16, 32] plot_validation_curve(clf, features,labels.ravel(), param_name, param_range, scoring='roc_auc')
give me a graph and a legend that don't seem to reflect the previous information:
Best test score: 0.3592

and finally sklearn metrics give me
Accuracy:0.737 Classification report precision recall f1-score support 0 0.76 0.79 0.78 8311 1 0.70 0.66 0.68 6134 avg / total 0.74 0.74 0.74 14445
It seems to me that it's best to understand this, who can help?