Calculation of the standard error of the estimate, statistical statistics of Wald-Chi, p-value with logistic regression in Spark

I tried to create a logistic regression model on sampled data.

The result of the model that we can get are the weights of the functions used to build the model.

I could not find the Spark API for the standard estimation error, statistics on the city of Wald Chi, p values, etc.

I insert my codes below as an example

import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics} import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} val sc = new SparkContext(new SparkConf().setAppName("SparkTest").setMaster("local[*]")) val sqlContext = new org.apache.spark.sql.SQLContext(sc); val data: RDD[String] = sc.textFile("C:/Users/user/Documents/spark-1.5.1-bin-hadoop2.4/data/mllib/credit_approval_2_attr.csv") val parsedData = data.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts(0), Vectors.dense(parts.tail)) } //Splitting the data val splits: Array[RDD[LabeledPoint]] = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L) val training: RDD[LabeledPoint] = splits(0).cache() val test: RDD[LabeledPoint] = splits(1) // Run training algorithm to build the model val model = new LogisticRegressionWithLBFGS() .setNumClasses(2) .run(training) // Clear the prediction threshold so the model will return probabilities model.clearThreshold print(model.weights) 

The output weight of the model is

 [-0.03335987643613915,0.025215092730373874,0.22617842810253946,0.29415985532104943,-0.0025559467210279694,4.5242237280512646E-4] 

just an array of weights.

Although I managed to calculate the accuracy, repetition, accuracy, sensitivity and other diagnostics of the model.

Is there a way I can calculate the standard estimation error, the statistics of Wald-Chi, the p value in Spark?

I am concerned that there is standard output in R or SAS.

Does this need to be done using the optimization method we use in Spark?

Here we use L-BFGS or SGD.

Maybe I am not aware of the assessment methodology.

Any suggestion would be highly appreciated.

+5
source share
1 answer

The following method will provide detailed information about the chi-square criteria -

Statistics.chiSqTest (data)

Input data

 val obs: RDD[LabeledPoint] = sc.parallelize( Seq( LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)), LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 0.0)), LabeledPoint(-1.0, Vectors.dense(-1.0, 0.0, -0.5) ) ) ) val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs) 

Returns an array containing ChiSquaredTestResult for each function with respect to the label.

test summary, including p-value, degrees of freedom, statistical statistics of the test, method used, and null hypothesis.

+2
source

All Articles