Flink SVM 90% incorrect classification

I am trying to do some binary classification with svm flink-ml implementation. When I rated the classification, I got an 85% error rate in the training kit. I built 3D data, and it looked like you could perfectly separate data with a hyperplane.

When I tried to get the weight vector from svm, I only saw the ability to get the weight vector without intercepting the hyperplane. So, just the hyperplane passes (0,0,0).

I have no clue where the error may be and evaluate each clue.

val env = ExecutionEnvironment.getExecutionEnvironment val input: DataSet[(Int, Int, Boolean, Double, Double, Double)] = env.readCsvFile(filepathTraining, ignoreFirstLine = true, fieldDelimiter = ";") val inputLV = input.map( t => { LabeledVector({if(t._3) 1.0 else -1.0}, DenseVector(Array(t._4, t._5, t._6)))} ) val trainTestDataSet = Splitter.trainTestSplit(inputLV, 0.8, precise = true, seed = 100) val trainLV = trainTestDataSet.training val testLV = trainTestDataSet.testing val svm = SVM() svm.fit(trainLV) val testVD = testLV.map(lv => (lv.vector, lv.label)) val evalSet = svm.evaluate(testVD) // groups the data in false negatives, false positives, true negatives, true positives evalSet.map(t => (t._1, t._2, 1)).groupBy(0,1).reduce((x1,x2) => (x1._1, x1._2, x1._3 + x2._3)).print() 

The displayed data is shown here:

Data plot

+7
scala svm apache-flink flinkml
source share
1 answer

The SVM classifier does not give you the distance to the beginning (the so-called offset or threshold), since this is a predictor parameter. Different threshold values ​​will lead to different indicators of accuracy and feedback, and the optimal one will be a specific use case. Usually we use the ROC curve (receiver performance characteristic) .

Related properties on SVM (from Flink docs ):

  • ThresholdValue - set the threshold for testing / forecasting. The findings below are classified as negative, and the outputs above as positive. The default value is 0.
  • OutputDecisionFunction - set this parameter to true to display the distance to the separation plane instead of binary classification.

ROC curve

How to find the optimal threshold is an art in itself. Knowing nothing more about the problem, you can always make a graph of the ROC curve (True Positive Rate to False Positive Rate) for different threshold values ​​and look for the point with the greatest distance from a random guess (line with a 0.5 slope). But, ultimately, the choice of threshold also depends on the cost of a false positive result and the cost of a false negative in your domain. Here is an example of a WOCpedia ROC curve for three different classifiers:

To select the initial threshold, you can average it over the training data (or its sample):

  // weights is a DataSet of size 1 val weights = svm.weightsOption.get.collect().head val initialThreshold = trainLV.map { lv => (lv.label - (weights dot lv.vector), 1l) }.reduce { (avg1, avg2) => (avg1._1 + avg2._1, avg1._2 + avg2._2) }.collect() match { case Seq((sum, len)) => sum / len } 

and then change it in a loop by measuring TPR and FPR for test data.

Other hyperparameters

Please note that the SVM trainer also has Parameters (these are called hyperparameters) that need to be tuned for optimal forecasting performance. There are many ways to do this, and this post will become too long to list them. I just wanted to draw your attention to this. If you feel lazy, here is a link to Wikipedia: Optimization of hyperparameters .

Other sizes?

There is a (somewhat) hack if you do not want to deal with the threshold right now. You can mash an offset into another dimension of a vector function like this:

 val bias = 10 // choose a large value val inputLV = input.map { t => LabeledVector( if (t._3) 1.0 else -1.0, DenseVector(Array(t._4, t._5, t._6, bias))) } 

Here is a good discussion about why you should NOT do this. Basically the problem is that the bias will be involved in regularization. But machine learning has no absolute truths.

+1
source share

All Articles