Python sklearn: what is the difference between scoring_score and learning_curve accuracy?

I use Python sklearn (version 0.17) to select the perfect model in the dataset. To do this, I followed these steps:

  • Split the data set with cross_validation.train_test_split with test_size = 0.2 .
  • Use GridSearchCV to select the ideal classifier of k-nearest neighbors in the training set.
  • Pass the classifier returned by GridSearchCV to plot_learning_curve . plot_learning_curve gave the graph shown below.
  • Run the classifier returned by GridSearchCV in the resulting test suite.

From the plot it can be seen that the rating for max. the amount of training is about 0.43. This score is the result returned by sklearn.learning_curve.learning_curve .

But when I run the best classifier in the test case, I get an accuracy score of 0.61 returned by sklearn.metrics.accuracy_score (correctly predicted marks / number of marks)

Image Link: graph plot for KNN classifier

This is the code I'm using. I did not include the plot_learning_curve function, as that would require a lot of space. I took plot_learning_curve from here

 import pandas as pd import numpy as np from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report from matplotlib import pyplot as plt import sys from sklearn import cross_validation from sklearn.learning_curve import learning_curve from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import train_test_split filename = sys.argv[1] data = np.loadtxt(fname = filename, delimiter = ',') X = data[:, 0:-1] y = data[:, -1] # last column is the label column X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2) params = {'n_neighbors': [2, 3, 5, 7, 10, 20, 30, 40, 50], 'weights': ['uniform', 'distance']} clf = GridSearchCV(KNeighborsClassifier(), param_grid=params) clf.fit(X_train, y_train) y_true, y_pred = y_test, clf.predict(X_test) acc = accuracy_score(y_pred, y_test) print 'accuracy on test set =', acc print clf.best_params_ for params, mean_score, scores in clf.grid_scores_: print "%0.3f (+/-%0.03f) for %r" % ( mean_score, scores.std() / 2, params) y_true, y_pred = y_test, clf.predict(X_test) #pred = clf.predict(np.array(features_test)) acc = accuracy_score(y_pred, y_test) print classification_report(y_true, y_pred) print 'accuracy last =', acc print plot_learning_curve(clf, "KNeighborsClassifier", X, y, train_sizes=np.linspace(.05, 1.0, 5)) 

This is normal? I can understand that there may be some kind of difference in estimates, but this is a difference of 0.18, which, when converted to percentage, is 43% against 61%. Classification_report also gives an average rating of 0.61.

Am I doing something wrong? Is there any difference in how learning_curve calculates points? I also tried passing the scoring='accuracy' function to learning_curve to make sure that it matches the accuracy estimate, but that didn't make any difference.

Any advice would be very helpful.

I use a wine quality (white) dataset from UCI , and also removed the header before running the code.

+8
python scikit-learn
source share
1 answer

When you call the learning_curve function, it cross-checks all of your data. Since you leave the cv parameter empty, this is a three-time strategy for splitting cross-validations. And here comes the tricky part, because, as stated in the documentation "If the evaluator is a classifier or if y is not binary or multiclass, KFold is used . " And your evaluator is a classifier.

So what's the difference between KFold and StratifiedKFold?

KFold = Divide the data set into k consecutive folds ( no shuffling by default )

StratifiedKFold = "Folds are created by storing the percentage of samples for each class.

Make a simple example:

  • your data labels: [4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0]
  • not broken into 3 times divided into subsets: [4.0, 4.0, 4.0], [5.0, 5.0, 5.0], [6.0, 6.0, 6.0]
  • each summary is then used to verify when the rest of k - 1 (3-2) is added from the training set. So, for example, there would be training on [5.0, 5.0, 5.0, 6.0, 6.0, 6.0] and verification on [4.0, 4.0, 4.0]

This explains your low accuracy in drawing the learning curve (~ 0.43%). Of course, this is an extreme example to illustrate the situation, but your data is somehow structured and you need to shuffle it.

But when you get ~ 61% accuracy, you split the data using the train_test_split method, which by default shuffles the data and maintains the proportions.

Just look at this, I performed a simple test to support my hypothesis:

 X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0., random_state=2) 

In your example, you sent learning_curve all your X,y data. I am doing a little trick here, which is to split the data saying test_size=0. , which means that all data is in train variables. That way, I still save all the data, but now it is shuffled when it executes the train_test_split function.

Then I called your build function, but with the shuffled data:

 plot_learning_curve(clf, "KNeighborsClassifier",X_train2, y_train2, train_sizes=np.linspace(.05, 1.0, 5)) 

Now the result with max num training samples instead of 0.43 is 0.59 , which greatly improves your GridSearch results.

Observation . I think the whole point of building a learning curve is to determine if more samples add to the training set that our evaluator is able to perform better or not (so you can decide, for example, when there is no need to add more examples). Like train_sizes , you just load the values np.linspace(.05, 1.0, 5) --> [ 0.05 , 0.2875, 0.525 , 0.7625, 1. ] . I'm not quite sure what is the use that you are doing in this type of test.

+8
source share

All Articles