Spark, MLlib: setting the classifier recognition threshold

Question

Spark, MLlib: setting the classifier recognition threshold

I am trying to use the Spark MLlib Logistic Regression (LR) and / or Random Forests (RF) classifiers to create a model to distinguish between two classes represented by sets whose power varies greatly.
One set has 150,000,000 negative and another 50,000 positive cases.

After training the LR and RF classifiers with default parameters, I get very similar results for both classifiers, for example, for the following test suite:

Test instances: 26842 Test positives = 433.0 Test negatives = 26409.0

The classifier detects:

 truePositives = 0.0 trueNegatives = 26409.0 falsePositives = 433.0 falseNegatives = 0.0 Precision = 0.9838685641904478 Recall = 0.9838685641904478

It seems that the classifier cannot detect any positive instance at all. In addition, no matter how the data was divided into trains and test sets, the classifier provides exactly the same number of false positives equal to the number of positives that the test set really has.

The default threshold for the LR classifier is set to 0.5. A setting threshold of up to 0.8 is irrelevant.

 val model = new LogisticRegressionWithLBFGS().run(training) model.setThreshold(0.8)

Questions:

1) Please advise how to manage the classifier threshold to make the classifier more susceptible to a class with a small proportion of positive instances versus a class with a huge number of negative instances?

2) Any other MLlib classifiers to solve this problem?

3) What parameter itercept used for the logistic regression algorithm?

 val model = new LogisticRegressionWithSGD().setIntercept(true).run(training)

+4

random-forest apache-spark apache-spark-mllib logistic-regression

zork Aug 3 '15 at 16:59

source share

1 answer

Dr vcomas · Accepted Answer · 2015-08-04T14:19:35+0000

Well, I think you have a problem with data imbalance here: 150,000,000 Class1 50,000 Class2. 3000 times less.

So, if you are training a classifier that assumes that all Class1 classes you will have: 0.999666 accuracy. Thus, the best classifier will always be ALL, this is Class1. Here is what your model is studying.

There are different ways to evaluate these cases, in general, you can do, lower the sample of a larger class or increase the sample of a smaller class or some other things that you can do with random forests, for example, when you make a choice in a balanced way (stratified ) or add weight:

http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

Other methods also exist like SMOTE etc. (also making samples) for more details you can read here:

https://www3.nd.edu/~dial/papers/SPRINGER05.pdf

The threshold that you can change for your logistic regression will be probability, you can try to play with "probability" in the parameters of the logistic regression example:

http://spark.apache.org/docs/latest/ml-guide.html

But now the problem with MLlib is that not all classifiers return probability, I asked them about it, and this is in their roadmap.

Spark, MLlib: setting the classifier recognition threshold

More articles: