I have a text classification task with 2599 documents and five labels from 1 to 5. Documents
label | texts ---------- 5 |1190 4 |839 3 |239 1 |204 2 |127
Everyone is ready to classify this text data with very low performance, and also receive warnings about poorly defined indicators:
Accuracy: 0.461057692308 score: 0.461057692308 precision: 0.212574195636 recall: 0.461057692308 'precision', 'predicted', average, warn_for) confussion matrix: [[ 0 0 0 0 153] 'precision', 'predicted', average, warn_for) [ 0 0 0 0 94] [ 0 0 0 0 194] [ 0 0 0 0 680] [ 0 0 0 0 959]] clasification report: precision recall f1-score support 1 0.00 0.00 0.00 153 2 0.00 0.00 0.00 94 3 0.00 0.00 0.00 194 4 0.00 0.00 0.00 680 5 0.46 1.00 0.63 959 avg / total 0.21 0.46 0.29 2080
It is clear that this is due to the fact that I have an unbalanced dataset, so I found this article where the authors offer several uploasts with this problem:
The problem is that with unbalanced data sets, it is too close to positive cases. We must be biased towards SVM in such a way as to push the border from positive cases. Veropoulos et al [14] propose using different error costs for positive (C +) and negative (C -) classes
I know this can be very difficult, but SVC offers a few hyperparameters. So my question is: is there a way to offset SVC in such a way as to delineate possible instances with hyperparameters offering the SVC classifier? I know this can be a difficult problem, but any help is appreciated, thanks in advance guys.
from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2)) from sklearn.cross_validation import train_test_split, cross_val_score import pandas as pd df = pd.read_csv('/path/of/the/file.csv', header=0, sep=',', names=['id', 'text', 'label']) reduced_data = tfidf_vect.fit_transform(df['text'].values) y = df['label'].values from sklearn.decomposition.truncated_svd import TruncatedSVD svd = TruncatedSVD(n_components=5) reduced_data = svd.fit_transform(reduced_data) from sklearn import cross_validation X_train, X_test, y_train, y_test = cross_validation.train_test_split(reduced_data, y, test_size=0.33)
But I don’t get anything, and I can’t understand what happened, this is the plot:

then
#Let show some metrics[unweighted]: from sklearn.metrics.metrics import precision_score, \ recall_score, confusion_matrix, classification_report, accuracy_score print '\nAccuracy:', accuracy_score(y_test, prediction) print '\nscore:', clf.score(X_train, y_train) print '\nrecall:', recall_score(y_test, prediction) print '\nprecision:', precision_score(y_test, prediction) print '\n clasification report:\n', classification_report(y_test, prediction) print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
This is the data that I use. How can I fix this and build this problem correctly ?. Thanks in advance guys!
From the answer in this question, I remove the following lines:
#
These indicators have improved. How can I build this result to have a good example, for example, documentation. I would like to see the behavior of two hyperplanes? . Thanks guys!