Scikit-Learn: not x tag present in all training examples

I am trying to do a multi-valued classification using SVM. I have almost 8k functions, and also a y-vector of length with almost 400. I already have binarized Y-vectors, so I have not used MultiLabelBinarizer() , but when I use it with my original Y data form, it is anyway gives the same thing.

I run this code:

 X = np.genfromtxt('data_X', delimiter=";") Y = np.genfromtxt('data_y', delimiter=";") training_X = X[:2600,:] training_y = Y[:2600,:] test_sample = X[2600:2601,:] test_result = Y[2600:2601,:] classif = OneVsRestClassifier(SVC(kernel='rbf')) classif.fit(training_X, training_y) print(classif.predict(test_sample)) print(test_result) 

After the whole fitting process, when it comes to the prediction part, it says: Label not x is present in all training examples (x is a few different numbers in the length range of my vector y, which is 400). After that, it gives the predicted vector y, which is always equal to zero with a vector of length 400 (the length of the vector y). I am new to scikit-learn as well as machines. I could not understand the problem here. What is the problem and what should I do to fix this? Thanks.

+5
python scikit-learn machine-learning
source share
1 answer

There are two problems here:

1) A warning about the absence of a label 2) You get all 0 for forecasts

A warning means that some of your classes are missing from the training data. This is a common problem. If you have 400 classes, then some of them should be very rare, and with any data splitting, some classes may be missing on one side of the split. There may also be classes that simply do not appear in your data. You can try Y.sum(axis=0).all() , and if it's False, then some classes do not even appear in Y. All this sounds horrible, but realistic, you cannot correctly predict the classes that occur 0, 1 or any very small number of times, so predicting 0 for them is probably about the best you can do.

As for all-0 predictions, I will point out that with 400 classes, probably all of your classes are encountered much less than half the time. You can check Y.mean(axis=0).max() to get the maximum frequency of shortcuts. With 400 classes, this can be only a few percent. If so, the binary classifier, which should make a forecast of 0-1 for each class, will probably choose 0 for all classes in all instances. This is not really a mistake, simply because all class frequencies are low.

If you know that each instance has a positive label (at least one), you can get the solution values โ€‹โ€‹( clf.decision_function ) and select the class with the highest value for each instance. However, you will have to write code.

I once had a top 10 in a Kaggle contest that looked like this one. This was a multigrid problem with ~ 200 classes, none of which even had a 10% frequency, and we need forecasts 0-1. In this case, I got the values โ€‹โ€‹of the solution and took the highest, plus everything that was above the threshold. I chose the threshold that worked best on the hold set. The code for this entry is on Github: Kaggle Greek Media code . You can take a look at it.

If you did it like that, thanks for reading. Hope this helps.

+10
source share

All Articles