I tried the problem of multi-level classification. My data is as follows:
DocID Content Tags 1 some text here... [70] 2 some text here... [59] 3 some text here... [183] 4 some text here... [173] 5 some text here... [71] 6 some text here... [98] 7 some text here... [211] 8 some text here... [188] . ............. ..... . ............. ..... . ............. .....
here is my code
traindf = pd.read_csv("mul.csv") print "This is what our training data looks like:" print traindf t=TfidfVectorizer() X=traindf["Content"] y=traindf["Tags"] print "Original Content" print X X=t.fit_transform(X) print "Content After transformation" print X print "Original Tags" print y y=MultiLabelBinarizer().fit_transform(y) print "Tags After transformation" print y print "Features extracted:" print t.get_feature_names() print "Scores of features extracted" idf = t.idf_ print dict(zip(t.get_feature_names(), idf)) print "Splitting into training and validation sets..." Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5) print "Training Set Content and Tags" print Xtrain print ytrain print "Validation Set Content and Tags" print Xvalidate print yvalidate print "Creating classifier" clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)) clf.fit(Xtrain, ytrain) predictions=clf.predict(Xvalidate) print "Predicted Tags are:" print predictions print "Correct Tags on Validation Set are :" print yvalidate print "Accuracy on validation set: %.3f" % clf.score(Xvalidate,yvalidate)
the code is working fine, but I keep getting these messages
X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 288 is present in all training examples. str(classes[c])) X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 304 is present in all training examples. str(classes[c])) X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 340 is present in all training examples.
what does it mean? Does this show that my data is not diverse enough?
python scikit-learn machine-learning logistic-regression multilabel-classification
Abtpst
source share