How to build SVC classification for an unbalanced dataset using scikit-learn and matplotlib?

I have a text classification task with 2599 documents and five labels from 1 to 5. Documents

label | texts ---------- 5 |1190 4 |839 3 |239 1 |204 2 |127 

Everyone is ready to classify this text data with very low performance, and also receive warnings about poorly defined indicators:

 Accuracy: 0.461057692308 score: 0.461057692308 precision: 0.212574195636 recall: 0.461057692308 'precision', 'predicted', average, warn_for) confussion matrix: [[ 0 0 0 0 153] 'precision', 'predicted', average, warn_for) [ 0 0 0 0 94] [ 0 0 0 0 194] [ 0 0 0 0 680] [ 0 0 0 0 959]] clasification report: precision recall f1-score support 1 0.00 0.00 0.00 153 2 0.00 0.00 0.00 94 3 0.00 0.00 0.00 194 4 0.00 0.00 0.00 680 5 0.46 1.00 0.63 959 avg / total 0.21 0.46 0.29 2080 

It is clear that this is due to the fact that I have an unbalanced dataset, so I found this article where the authors offer several uploasts with this problem:

The problem is that with unbalanced data sets, it is too close to positive cases. We must be biased towards SVM in such a way as to push the border from positive cases. Veropoulos et al [14] propose using different error costs for positive (C +) and negative (C -) classes

I know this can be very difficult, but SVC offers a few hyperparameters. So my question is: is there a way to offset SVC in such a way as to delineate possible instances with hyperparameters offering the SVC classifier? I know this can be a difficult problem, but any help is appreciated, thanks in advance guys.

 from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2)) from sklearn.cross_validation import train_test_split, cross_val_score import pandas as pd df = pd.read_csv('/path/of/the/file.csv', header=0, sep=',', names=['id', 'text', 'label']) reduced_data = tfidf_vect.fit_transform(df['text'].values) y = df['label'].values from sklearn.decomposition.truncated_svd import TruncatedSVD svd = TruncatedSVD(n_components=5) reduced_data = svd.fit_transform(reduced_data) from sklearn import cross_validation X_train, X_test, y_train, y_test = cross_validation.train_test_split(reduced_data, y, test_size=0.33) #with no weights: from sklearn.svm import SVC clf = SVC(kernel='linear', class_weight={1: 10}) clf.fit(reduced_data, y) prediction = clf.predict(X_test) w = clf.coef_[0] a = -w[0] / w[1] xx = np.linspace(-5, 5) yy = a * xx - clf.intercept_[0] / w[1] # get the separating hyperplane using weighted classes wclf = SVC(kernel='linear', class_weight={1: 10}) wclf.fit(reduced_data, y) ww = wclf.coef_[0] wa = -ww[0] / ww[1] wyy = wa * xx - wclf.intercept_[0] / ww[1] # plot separating hyperplanes and samples import matplotlib.pyplot as plt h0 = plt.plot(xx, yy, 'k-', label='no weights') h1 = plt.plot(xx, wyy, 'k--', label='with weights') plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired) plt.legend() plt.axis('tight') plt.show() 

But I don’t get anything, and I can’t understand what happened, this is the plot:

weighted vs normal

then

 #Let show some metrics[unweighted]: from sklearn.metrics.metrics import precision_score, \ recall_score, confusion_matrix, classification_report, accuracy_score print '\nAccuracy:', accuracy_score(y_test, prediction) print '\nscore:', clf.score(X_train, y_train) print '\nrecall:', recall_score(y_test, prediction) print '\nprecision:', precision_score(y_test, prediction) print '\n clasification report:\n', classification_report(y_test, prediction) print '\n confussion matrix:\n',confusion_matrix(y_test, prediction) #Let show some metrics[weighted]: print 'weighted:\n' from sklearn.metrics.metrics import precision_score, \ recall_score, confusion_matrix, classification_report, accuracy_score print '\nAccuracy:', accuracy_score(y_test, prediction) print '\nscore:', wclf.score(X_train, y_train) print '\nrecall:', recall_score(y_test, prediction) print '\nprecision:', precision_score(y_test, prediction) print '\n clasification report:\n', classification_report(y_test, prediction) print '\n confussion matrix:\n',confusion_matrix(y_test, prediction) 

This is the data that I use. How can I fix this and build this problem correctly ?. Thanks in advance guys!

From the answer in this question, I remove the following lines:

 # # from sklearn.decomposition.truncated_svd import TruncatedSVD # svd = TruncatedSVD(n_components=5) # reduced_data = svd.fit_transform(reduced_data) # # w = clf.coef_[0] # a = -w[0] / w[1] # xx = np.linspace(-10, 10) # yy = a * xx - clf.intercept_[0] / w[1] # ww = wclf.coef_[0] # wa = -ww[0] / ww[1] # wyy = wa * xx - wclf.intercept_[0] / ww[1] # # # plot separating hyperplanes and samples # import matplotlib.pyplot as plt # h0 = plt.plot(xx, yy, 'k-', label='no weights') # h1 = plt.plot(xx, wyy, 'k--', label='with weights') # plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired) # plt.legend() # # plt.axis('tight') # plt.show() This where the results: Accuracy: 0.787878787879 score: 0.779437105112 recall: 0.787878787879 precision: 0.827705441238 

These indicators have improved. How can I build this result to have a good example, for example, documentation. I would like to see the behavior of two hyperplanes? . Thanks guys!

+5
source share
5 answers

Reducing your data to 5 with SVD :

 svd = TruncatedSVD(n_components=5) reduced_data = svd.fit_transform(reduced_data) 

You are losing a lot of information. Just deleting these lines gives me 78% accuracy.

Exiting the class_weight parameter as you set it seems to be better than removing it. I have not tried to give other values.

Look at using k-fold cross validation and searching the grid to adjust your model’s parameters. You can also use pipeline if you want to reduce the dimension of your data to find out how much you want to reduce it without affecting performance. Here 's an example that shows how to set up your entire pipeline using a grid search.

As for plotting, you can only display 2d or 3d data. After you exercise using more measurements, you can reduce your data to 2 or 3 measurements and build a graph. See here for an example of plotting. The code is similar to what you are plotting, and I got similar results. The problem is that your data has many functions, and you can only create stories on the 2nd or 3rd surface. It usually makes you look weird and hard to tell what is happening.

I suggest you not to worry about the plot, since he is not going to talk a lot about data in large sizes. Use k-fold cross-validation with grid search to get the best options, and if you want to take a closer look at the reassignment, sketch the learning curves .

All of this together will tell you much more about the behavior of your model than about building a hyperplane.

+4
source

If I understand your entry correctly, you:

1190 of 5 tagged texts 1409 of 1-4 tagged texts

You can try to make a classic classification. The first threat of all 5 labels is as 1 and all the others as 0. Build a classifier for this task

Second, cross out all 5 examples from your data set. Train classifier to classify 1-4 tags.

After classification, start the first classifier, if it returns 0 - it starts the second classifier to get the final label.

Although I do not think that this distribution is really distorted and has no balance (there should be something like 90% of 5, 10% - everything else to be really skewed, so it would be interesting to introduce bias in SVC). So, I think you might want to try a different classification algorithm, because it looks like your choice is not suitable for this task. Or maybe you need to use another kernel with SVC (I assume you are using a linear kernel, try something else - maybe RBF or a polynomial).

+2
source

As a simple solution, just multiply instances into smaller classes and balance the number of instances. It works even seems silly, and it is not required in the rig configuration.

The idea behind this approach is to simulate the behavior of a scaled learning speed for each class relative to its class size. That is, in gradient-based optimization methods, you must scale the learning speed inversely proportional to the class sizes for each class so that you can prevent the model from overriding some classes against others.

If your problem is quite large and you are using batch updates, then instead of ordering all data and counting classes, consider only the mini-package and adjust the training courses dynamically with respect to the number of copies for each class in the mini-batch,

This means that if your master training level is 0.01, and in the instance package you have 0.4 of them of class A and 0.6 of them of class B, then for each class you need to configure the final learning speed as speed master_learning for class A (this means that it is), 2/3 * master_learning speed for class B. Therefore, you take a wider step for class A and vice versa for class B.

My choice is especially for larger problems and increasing data for smaller classes by replicating instances or a more robust choice, adding some noise and deviation for replicated instances. Thus (depending on your problem) you can also prepare a model that is more resistant to small changes (this is very common with image classification features).

+2
source

You have probably already tried setting class-weight to auto , but I would like to make sure of that.

Perhaps balancing experiments ( oversampling or undersampling ) may help, some libs have already been recommended by klubow for it.

+2
source
+1
source

Source: https://habr.com/ru/post/1213204/


All Articles