Python sklearn Multilabel Classification: UserWarning: Label not 226 is present in all training examples

Question

Python sklearn Multilabel Classification: UserWarning: Label not 226 is present in all training examples

I tried the problem of multi-level classification. My data is as follows:

DocID Content Tags 1 some text here... [70] 2 some text here... [59] 3 some text here... [183] 4 some text here... [173] 5 some text here... [71] 6 some text here... [98] 7 some text here... [211] 8 some text here... [188] . ............. ..... . ............. ..... . ............. .....

here is my code

 traindf = pd.read_csv("mul.csv") print "This is what our training data looks like:" print traindf t=TfidfVectorizer() X=traindf["Content"] y=traindf["Tags"] print "Original Content" print X X=t.fit_transform(X) print "Content After transformation" print X print "Original Tags" print y y=MultiLabelBinarizer().fit_transform(y) print "Tags After transformation" print y print "Features extracted:" print t.get_feature_names() print "Scores of features extracted" idf = t.idf_ print dict(zip(t.get_feature_names(), idf)) print "Splitting into training and validation sets..." Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5) print "Training Set Content and Tags" print Xtrain print ytrain print "Validation Set Content and Tags" print Xvalidate print yvalidate print "Creating classifier" clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)) clf.fit(Xtrain, ytrain) predictions=clf.predict(Xvalidate) print "Predicted Tags are:" print predictions print "Correct Tags on Validation Set are :" print yvalidate print "Accuracy on validation set: %.3f" % clf.score(Xvalidate,yvalidate)

the code is working fine, but I keep getting these messages

 X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 288 is present in all training examples. str(classes[c])) X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 304 is present in all training examples. str(classes[c])) X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 340 is present in all training examples.

what does it mean? Does this show that my data is not diverse enough?

+3

python scikit-learn machine-learning logistic-regression multilabel-classification

Abtpst Dec 17 '15 at 18:51

source share

1 answer

Keelan · Accepted Answer · 2015-12-17T20:01:13+0000

Some data mining algorithms have problems when some elements are present in all or many records. This, for example, is a problem when executing a rule mining command using the Apriori algorithm.

Whether this is a problem or not depends on the classifier. I do not know the specific classifier that you are using, but here is an example where it can make a difference when setting up a decision tree with maximum depth.

Say that you install the decision tree with maximum depth using the Hunt algorithm and the GINI index to determine the best split (see here for an explanation, slide 35 onwards). The first split may be whether the record has a label 288. If each record has this label, the GINI index will be optimal for such a separation. This means that the first so many splits will be useless, because you do not actually split the training set (you split in an empty set without 288, and the set itself with 288). So the first so many levels of the tree are useless. If you set the maximum depth, this can lead to a tree with low accuracy.

In any case, the warning you receive is not a problem with your code, at best with your data set. You should check if this classifier is sensitive to this type of thing; ndash if so, it can give better results if you filter shortcuts that are everywhere.

Python sklearn Multilabel Classification: UserWarning: Label not 226 is present in all training examples

More articles: