I am trying to use the Spark MLlib Logistic Regression (LR) and / or Random Forests (RF) classifiers to create a model to distinguish between two classes represented by sets whose power varies greatly.
One set has 150,000,000 negative and another 50,000 positive cases.
After training the LR and RF classifiers with default parameters, I get very similar results for both classifiers, for example, for the following test suite:
Test instances: 26842 Test positives = 433.0 Test negatives = 26409.0
The classifier detects:
truePositives = 0.0 trueNegatives = 26409.0 falsePositives = 433.0 falseNegatives = 0.0 Precision = 0.9838685641904478 Recall = 0.9838685641904478
It seems that the classifier cannot detect any positive instance at all. In addition, no matter how the data was divided into trains and test sets, the classifier provides exactly the same number of false positives equal to the number of positives that the test set really has.
The default threshold for the LR classifier is set to 0.5. A setting threshold of up to 0.8 is irrelevant.
val model = new LogisticRegressionWithLBFGS().run(training) model.setThreshold(0.8)
Questions:
1) Please advise how to manage the classifier threshold to make the classifier more susceptible to a class with a small proportion of positive instances versus a class with a huge number of negative instances?
2) Any other MLlib classifiers to solve this problem?
3) What parameter itercept used for the logistic regression algorithm?
val model = new LogisticRegressionWithSGD().setIntercept(true).run(training)