When you call the learning_curve function, it cross-checks all of your data. Since you leave the cv parameter empty, this is a three-time strategy for splitting cross-validations. And here comes the tricky part, because, as stated in the documentation "If the evaluator is a classifier or if y is not binary or multiclass, KFold is used . " And your evaluator is a classifier.
So what's the difference between KFold and StratifiedKFold?
KFold = Divide the data set into k consecutive folds ( no shuffling by default )
StratifiedKFold = "Folds are created by storing the percentage of samples for each class.
Make a simple example:
- your data labels: [4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0]
- not broken into 3 times divided into subsets: [4.0, 4.0, 4.0], [5.0, 5.0, 5.0], [6.0, 6.0, 6.0]
- each summary is then used to verify when the rest of k - 1 (3-2) is added from the training set. So, for example, there would be training on [5.0, 5.0, 5.0, 6.0, 6.0, 6.0] and verification on [4.0, 4.0, 4.0]
This explains your low accuracy in drawing the learning curve (~ 0.43%). Of course, this is an extreme example to illustrate the situation, but your data is somehow structured and you need to shuffle it.
But when you get ~ 61% accuracy, you split the data using the train_test_split method, which by default shuffles the data and maintains the proportions.
Just look at this, I performed a simple test to support my hypothesis:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0., random_state=2)
In your example, you sent learning_curve all your X,y data. I am doing a little trick here, which is to split the data saying test_size=0. , which means that all data is in train variables. That way, I still save all the data, but now it is shuffled when it executes the train_test_split function.
Then I called your build function, but with the shuffled data:
plot_learning_curve(clf, "KNeighborsClassifier",X_train2, y_train2, train_sizes=np.linspace(.05, 1.0, 5))
Now the result with max num training samples instead of 0.43 is 0.59 , which greatly improves your GridSearch results.
Observation . I think the whole point of building a learning curve is to determine if more samples add to the training set that our evaluator is able to perform better or not (so you can decide, for example, when there is no need to add more examples). Like train_sizes , you just load the values np.linspace(.05, 1.0, 5) --> [ 0.05 , 0.2875, 0.525 , 0.7625, 1. ] . I'm not quite sure what is the use that you are doing in this type of test.