Use Scikit-Learn to classify into multiple categories

Question

Use Scikit-Learn to classify into multiple categories

I am trying to use one of the training methods with scikit-learn to classify text fragments into one or more categories. The prediction function of all the algorithms I tried just returns a single match.

For example, I have a piece of text:

"Theaters in New York compared to those in London"

And I trained the algorithm to select a place for each piece of text that I feed.

In the above example, I would like to return New York and London , but it only returns New York .

Can scikit-learn be used to get multiple results? Or even return the label with the next highest probability?

Thanks for your help.

--- Update

I tried using OneVsRestClassifier but I still get only one option back per piece of text. Below is an example of the code I'm using

 y_train = ('New York','London') train_set = ("new york nyc big apple", "london uk great britain") vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5} count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab) test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too') X_vectorized = count.transform(train_set).todense() smatrix2 = count.transform(test_set).todense() base_clf = MultinomialNB(alpha=1) clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train) Y_pred = clf.predict(smatrix2) print Y_pred

Result: ['New York' 'London' 'London']

+68

python scikit-learn classification

CodeMonkeyB May 10 '12 at 1:59 a.m.

source share

6 answers

EDIT: Updated for Python 3, scikit-learn 0.18.1 using MultiLabelBinarizer, as suggested.

I also worked on this, and improved mwv a little excellent answer which may be useful. It accepts text labels as input rather than binary labels and encodes them using MultiLabelBinarizer.

 import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import MultiLabelBinarizer X_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"]) y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"], ["new york"],["london"],["london"],["london"],["london"], ["london"],["london"],["new york","london"],["new york","london"]] X_test = np.array(['nice day in nyc', 'welcome to london', 'london is rainy', 'it is raining in britian', 'it is raining in britian and the big apple', 'it is raining in britian and nyc', 'hello welcome to new york. enjoy it here and london too']) target_names = ['New York', 'London'] mlb = MultiLabelBinarizer() Y = mlb.fit_transform(y_train_text) classifier = Pipeline([ ('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, Y) predicted = classifier.predict(X_test) all_labels = mlb.inverse_transform(predicted) for item, labels in zip(X_test, all_labels): print('{0} => {1}'.format(item, ', '.join(labels)))

This gives me the following result:

 nice day in nyc => new york welcome to london => london london is rainy => london it is raining in britian => london it is raining in britian and the big apple => new york it is raining in britian and nyc => london, new york hello welcome to new york. enjoy it here and london too => london, new york

+50

J Maurer 04 Oct '13 at 2:10

source share

I came across this too, and the problem for me was that my y_Train was a string sequence, not a String sequence sequence. Apparently, OneVsRestClassifier will decide based on the input label format whether to use multi-class or multi-label. So change:

 y_train = ('New York','London')

to

 y_train = (['New York'],['London'])

Apparently, this will disappear in the future, as it breaks all the tags the same: https://github.com/scikit-learn/scikit-learn/pull/1987

+5

user2824135 Sep 27 '13 at 15:49

source share

Change this line to work in newer versions of python

 # lb = preprocessing.LabelBinarizer() lb = preprocessing.MultiLabelBinarizer()

+5

Srini Sydney Nov 06 '16 at 22:03

source share

A few multi-classification examples are given below:

Example 1: -

 import numpy as np from sklearn.preprocessing import LabelBinarizer encoder = LabelBinarizer() arr2d = np.array([1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,1]) transfomed_label = encoder.fit_transform(arr2d) print(transfomed_label)

Exit

 [[1 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 1 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 1 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 1 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 1 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 1 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 1 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 1 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 1 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 1 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 1 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 1 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 1 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 1] [1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Example 2:

 import numpy as np from sklearn.preprocessing import LabelBinarizer encoder = LabelBinarizer() arr2d = np.array(['Leopard','Lion','Tiger', 'Lion']) transfomed_label = encoder.fit_transform(arr2d) print(transfomed_label)

Exit

 [[1 0 0] [0 1 0] [0 0 1] [0 1 0]]

0

Goyal Vicky Feb 12 '18 at 11:40

source share

How to find the accuracy of test data in this program? I am having trouble finding accuracy using this function. I want to calculate f1 counter recall accuracy.

0

Ghufran Jalil Dec 08 '18 at 11:01

source share

mwv · Accepted Answer · 2012-05-10 05:23

What you want is called multi-label classification. Scikits-learn can do this. See here: http://scikit-learn.org/dev/modules/multiclass.html .

I'm not sure what will go wrong in your example, my version of sklearn does not seem to have WordNGramAnalyzer. Perhaps this is a matter of using more training examples or trying to use another classifier? Although note that the multi-tag classifier expects the target to be a list of tuples / tag lists.

The following works for me:

 import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier X_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"]) y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]] X_test = np.array(['nice day in nyc', 'welcome to london', 'hello welcome to new york. enjoy it here and london too']) target_names = ['New York', 'London'] classifier = Pipeline([ ('vectorizer', CountVectorizer(min_n=1,max_n=2)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, y_train) predicted = classifier.predict(X_test) for item, labels in zip(X_test, predicted): print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

For me, this produces the output:

 nice day in nyc => New York welcome to london => London hello welcome to new york. enjoy it here and london too => New York, London

Hope this helps.

Use Scikit-Learn to classify into multiple categories

More articles: