Calculation of the closest match to the middle / Stddev pairs with LibSVM

I am new to SVM and I am trying to use the Python interface for libsvm to classify a sample containing the mean and stddev. However, I get meaningless results.

Is this task inappropriate for SVM or is there an error in my use of libsvm? The following is a simple Python script that I use for testing:

#!/usr/bin/env python # Simple classifier test. # Adapted from the svm_test.py file included in the standard libsvm distribution. from collections import defaultdict from svm import * # Define our sparse data formatted training and testing sets. labels = [1,2,3,4] train = [ # key: 0=mean, 1=stddev {0:2.5,1:3.5}, {0:5,1:1.2}, {0:7,1:3.3}, {0:10.3,1:0.3}, ] problem = svm_problem(labels, train) test = [ ({0:3, 1:3.11},1), ({0:7.3,1:3.1},3), ({0:7,1:3.3},3), ({0:9.8,1:0.5},4), ] # Test classifiers. kernels = [LINEAR, POLY, RBF] kname = ['linear','polynomial','rbf'] correct = defaultdict(int) for kn,kt in zip(kname,kernels): print kt param = svm_parameter(kernel_type = kt, C=10, probability = 1) model = svm_model(problem, param) for test_sample,correct_label in test: pred_label, pred_probability = model.predict_probability(test_sample) correct[kn] += pred_label == correct_label # Show results. print '-'*80 print 'Accuracy:' for kn,correct_count in correct.iteritems(): print '\t',kn, '%.6f (%i of %i)' % (correct_count/float(len(test)), correct_count, len(test)) 

The domain looks pretty simple. I would expect that if he learns to know the average value of 2.5 means label 1, then when he sees the average value of 2.4, he should return label 1 as the most likely classification. However, each core has an accuracy of 0%. Why is this?

A few side notes, is there a way to hide all the detailed training output dumped by libsvm in the terminal? I was looking for documents and libsvm code, but I cannot find a way to disable this.

In addition, I wanted to use simple strings as keys in my sparse dataset (for example, {'mean': 2.5, 'stddev': 3.5}). Unfortunately, libsvm only supports integers. I tried using a long integer representation of the string (for example, "mean" == 1109110110971110), but libsvm seems to truncate them to regular 32-bit integers. The only workaround that I see is to save a separate “key” file, which maps each line to an integer (“average” = 0, “stddev” = 1). But obviously, it will be painful, since I will have to maintain and save the second file along with the serialized classifier. Does anyone see an easier way?

+6
python artificial-intelligence machine-learning svm libsvm
source share
2 answers

The problem seems to be related to combining multiclass prediction with probability estimates.

If you tweak your code to not make probability estimates, it actually works , for example:

 <snip> # Test classifiers. kernels = [LINEAR, POLY, RBF] kname = ['linear','polynomial','rbf'] correct = defaultdict(int) for kn,kt in zip(kname,kernels): print kt param = svm_parameter(kernel_type = kt, C=10) # Here -> rm probability = 1 model = svm_model(problem, param) for test_sample,correct_label in test: # Here -> change predict_probability to just predict pred_label = model.predict(test_sample) correct[kn] += pred_label == correct_label </snip> 

With this change, I get:

 -------------------------------------------------------------------------------- Accuracy: polynomial 1.000000 (4 of 4) rbf 1.000000 (4 of 4) linear 1.000000 (4 of 4) 

A forecast with probability estimates works if you double the data in the training set (i.e., double-include each data point). However, I could not find the parameterization of the model in any way, so a multiclass prediction with probabilities will work only with the original four training points.

+5
source share

If you are interested in another way to do this, you can do the following. This method is theoretically more sonorous, but not so straightforward.

Mentioning the mean and std, it seems that you are referencing data that you think should be distributed in some way. For example, the data you are observing is Gaussian distributed. You can then use Symmetrised Kullback-Leibler_divergence as a measure of the distance between these distributions. Then you can use something like k-nearest neighbor for classification.

For two probability densities p and q, you have KL (p, q) = 0 only if p and q are the same. However, KL is not symmetrical - therefore, to have the correct measure of distance, you can use

distance (p1, p2) = KL (p1, p2) + KL (p1, p2)

For the Gaussians, KL (p1, p2) = {(μ1 - μ2) ^ 2 + σ1 ^ 2 - σ2 ^ 2} / (2.σ2 ^ 2) + ln (σ2 / σ1). (I stole it from here , where you can also find the deviation :)

In short:

Given a set of training D (middle, std, class) tuples and a new pair p = (average, std), we find q in D for which the distance (d, p) is minimal and returns this class.

For me, this is better than the multi-core SVM approach, as the classification method is not so arbitrary.

+3
source share

All Articles