Prediction of class probabilities in the case of increasing the gradient of trees into sparks using the output tree

Spark's GBT s are known to give you the predicted tags at the moment.

I was thinking of trying to calculate the predicted probabilities for a class (say, all instances falling under a particular sheet)

Codes for assembling GBT

import org.apache.spark.SparkContext import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel import org.apache.spark.mllib.util.MLUtils //Importing the data val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository //Parsing the data val parsedData = data.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts(0), Vectors.dense(parts.tail)) } //Splitting the data val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L) val training = splits(0).cache() val test = splits(1) // Train a GradientBoostedTrees model. // The defaultParams for Classification use LogLoss by default. val boostingStrategy = BoostingStrategy.defaultParams("Classification") boostingStrategy.numIterations = 2 // We can use more iterations in practice. boostingStrategy.treeStrategy.numClasses = 2 boostingStrategy.treeStrategy.maxDepth = 2 boostingStrategy.treeStrategy.maxBins = 32 boostingStrategy.treeStrategy.subsamplingRate = 0.5 boostingStrategy.treeStrategy.maxMemoryInMB =1024 boostingStrategy.learningRate = 0.1 // Empty categoricalFeaturesInfo indicates all features are continuous. boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]() val model = GradientBoostedTrees.train(training, boostingStrategy) model.toDebugString 

This gives me 2 trees of depth 2, as shown below:

  Tree 0: If (feature 3 <= 2.0) If (feature 2 <= 1.25) Predict: -0.5752212389380531 Else (feature 2 > 1.25) Predict: 0.07462686567164178 Else (feature 3 > 2.0) If (feature 0 <= 30.17) Predict: 0.7272727272727273 Else (feature 0 > 30.17) Predict: 1.0 Tree 1: If (feature 5 <= 67.0) If (feature 4 <= 100.0) Predict: 0.5739387416147804 Else (feature 4 > 100.0) Predict: -0.550117566730937 Else (feature 5 > 67.0) If (feature 2 <= 0.0) Predict: 3.0383669122382835 Else (feature 2 > 0.0) Predict: 0.4332824083446489 

My question is: can I use the trees above to calculate predicted probabilities like:

For each instance in the set of functions used to predict

exp (sheet rating from tree 0 + sheet rating from tree 1) / (1 + exp (sheet rating from tree 0 + sheet rating from tree 1))

This gives me some credibility. But not sure if this is the right way to do this. Also, if there is any document explaining how the sheet rating (forecast) is calculated. I would be very grateful if anyone could share.

Any suggestion would be excellent.

+5
source share
5 answers

Here is my approach using Spark's internal dependencies. You will need to import the linear algebra library for the matrix operation later, i.e. Multiply tree predictions by learning speed.

 import org.apache.spark.mllib.linalg.{Vectors, Matrices} import org.apache.spark.mllib.linalg.distributed.{RowMatrix} 

Suppose you created a model with GBT:

 val model = GradientBoostedTrees.train(trainingData, boostingStrategy) 

To calculate probability using a model object:

 // Get the log odds predictions from each tree val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) } // Transform the arrays into matrices for multiplication val treePredictionsVector = treePredictions.map(array => Vectors.dense(array)) val treePredictionsMatrix = new RowMatrix(treePredictionsVector) val learningRate = model.treeWeights val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate) val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix) // Calculate probability by ensembling the log odds val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x))) classProb.collect // You may tweak your decision boundary for different class labels val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0) classLabel.collect 

Here is a snippet of code that you can copy and paste directly into the spark shell:

 import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.{Vectors, Matrices} import org.apache.spark.mllib.linalg.distributed.{RowMatrix} import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel // Load and parse the data file. val csvData = sc.textFile("data/mllib/sample_tree_data.csv") val data = csvData.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts(0), Vectors.dense(parts.tail)) } // Split the data into training and test sets (30% held out for testing) val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a GBT model. val boostingStrategy = BoostingStrategy.defaultParams("Classification") boostingStrategy.numIterations = 50 boostingStrategy.treeStrategy.numClasses = 2 boostingStrategy.treeStrategy.maxDepth = 6 boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]() val model = GradientBoostedTrees.train(trainingData, boostingStrategy) // Get class label from raw predict function val predictedLabels = model.predict(testData.map(_.features)) predictedLabels.collect // Get class probability val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) } val treePredictionsVector = treePredictions.map(array => Vectors.dense(array)) val treePredictionsMatrix = new RowMatrix(treePredictionsVector) val learningRate = model.treeWeights val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate) val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix) val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x))) val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0) classLabel.collect 
+2
source
 def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = { val treePredictions = gbdt.trees.map(_.predict(features)) blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1) } def sigmoid(v : Double) : Double = { 1/(1+Math.exp(-v)) } // model is output of GradientBoostedTrees.train(...,...) // testData is libSVM format val labelAndPreds = testData.map { point => var prediction = score(point.features,model) prediction = sigmoid(prediction) (point.label, Vectors.dense(1.0-prediction, prediction)) } 
+1
source

In fact, I was able to predict probabilities using the tree and the wording of the tree asked in the question. I actually checked with the predicted GBT label. It matches exactly when I use a threshold like 0.5.

So, we do the same with minor changes.

For each instance in the set of functions used to predict:

exp (sheet rating from tree 0 + (learning_rate) * sheet rating from tree 1) / (1 + exp (sheet rating from tree 0 + (learning_rate) * sheet rating from tree 1))

It gives me predicted probabilities.

I tested the same thing on 3 trees with a depth of 3. This worked. And also with various data sets.

It would be great to know if anyone else has tried this. If not, they can try it and comment.

0
source

In fact, the above-mentioned ans are erroneous, the sigmoid function in this situation is false to correct the conversion mark in {-1,1}. You should use this code:

 def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = { val treePredictions = gbdt.trees.map(_.predict(features)) blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1) } val labelAndPreds = testData.map { point => var prediction = score(point.features,model) prediction = 1.0 / (1.0 + math.exp(-2.0 * prediction)) (point.label, Vectors.dense(1.0-prediction, prediction)) } 

More information can be found on page 9 “Approximation of greedy functions”, “Accelerating gradient machine”. And a stretch request in a lawsuit: https://github.com/apache/spark/pull/16441

0
source

Actually, @hbghhy did not see it right, @ Run2 is right, Spark uses the double negative probability of the logarithm twice as much as Loss, but Friedman uses the double negative probability of the log as a loss on page 9, “Approximation of the greed function” .

 /** * :: DeveloperApi :: * Class for log loss calculation (for classification). * This uses twice the binomial negative log likelihood, called "deviance" in Friedman (1999). * * The log loss is defined as: * 2 log(1 + exp(-2 y F(x))) * where y is a label in {-1, 1} and F(x) is the model prediction for features x. */ @Since("1.2.0") @DeveloperApi object LogLoss extends ClassificationLoss { /** * Method to calculate the loss gradients for the gradient boosting calculation for binary * classification * The gradient with respect to F(x) is: - 4 y / (1 + exp(2 y F(x))) * @param prediction Predicted label. * @param label True label. * @return Loss gradient */ @Since("1.2.0") override def gradient(prediction: Double, label: Double): Double = { - 4.0 * label / (1.0 + math.exp(2.0 * label * prediction)) } override private[spark] def computeError(prediction: Double, label: Double): Double = { val margin = 2.0 * label * prediction // The following is equivalent to 2.0 * log(1 + exp(-margin)) but more numerically stable. 2.0 * MLUtils.log1pExp(-margin) } } 
0
source

All Articles