Scikit: calculate accuracy and call using cross_val_score function

I use scikit to perform logistic regression on spam / ham data. X_train is my training data and y_train tags ("spam" or "ham"), and I trained my logistic regression as follows:

classifier = LogisticRegression() classifier.fit(X_train, y_train) 

If I want to get accuracy for checking cross-references 10 times, I just write:

  accuracy = cross_val_score(classifier, X_train, y_train, cv=10) 

I thought it was possible to compute prefixes and reminders as well by simply adding one parameter this way:

 precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision') recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall') 

But this leads to a ValueError :

 ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4') 

Is it related to data (should labels be binarized?) Or do they change the cross_val_score function?

Thank you in advance!

+8
python scikit-learn precision machine-learning logistic-regression
source share
4 answers

To calculate feedback and accuracy, the data must be truly binarized as follows:

 from sklearn import preprocessing lb = preprocessing.LabelBinarizer() lb.fit(y_train) 

To go further, I was surprised that I did not need to binarize the data when I wanted to calculate the accuracy:

 accuracy = cross_val_score(classifier, X_train, y_train, cv=10) 

This is simply because the accuracy formula really does not need information about which class is considered positive or negative: (TP + TN) / (TP + TN + FN + FP). We really can see that TP and TN are changing, this is not the case for recall, accuracy and f1.

+8
source share

The syntax shown above is correct. Looks like a problem with the data you are using. Labels do not have to be binarized unless they are continuous numbers.

You can show the same syntax with a different dataset:

 iris = sklearn.dataset.load_iris() X_train = iris['data'] y_train = iris['target'] classifier = LogisticRegression() classifier.fit(X_train, y_train) print cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision') print cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall') 
+1
source share

Here I ran into the same problem and I solved it with

 # precision, recall and F1 from sklearn.preprocessing import LabelBinarizer lb = LabelBinarizer() y_train = np.array([number[0] for number in lb.fit_transform(y_train)]) recall = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall') print('Recall', np.mean(recall), recall) precision = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision') print('Precision', np.mean(precision), precision) f1 = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1') print('F1', np.mean(f1), f1) 
+1
source share

You can use cross validation like this to get f1-score and remember:

 print('10-fold cross validation:\n') start_time = time() scores = cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='f1') recall_score=cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='recall') print(label+" f1: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), 'DecisionTreeClassifier')) print("---Classifier %s use %s seconds ---" %('DecisionTreeClassifier', (time() - start_time))) 

for more rating options just browse the page

0
source share

All Articles