Spark K-fold Cross Validation

Question

Spark K-fold Cross Validation

I am having some problems understanding spark check. Any example that I saw uses it to configure the parameters, but I assumed that it would also perform the usual cross-validation in K-fold?

What I want to do is cross-check in k-fold order, where k = 5. I want to get the accuracy for each result, and then get the average accuracy. In scikit, learn how to do this, where the results will give you the result for each fold, and then you can use scores.mean ()

scores = cross_val_score(classifier, y, x, cv=5, scoring='accuracy')

This is how I do it in Spark, paramGridBuilder is empty, since I don't want to enter any parameters.

 val paramGrid = new ParamGridBuilder().build() val evaluator = new MulticlassClassificationEvaluator() evaluator.setLabelCol("label") evaluator.setPredictionCol("prediction") evaluator.setMetricName("precision") val crossval = new CrossValidator() crossval.setEstimator(classifier) crossval.setEvaluator(evaluator) crossval.setEstimatorParamMaps(paramGrid) crossval.setNumFolds(5) val modelCV = crossval.fit(df4) val chk = modelCV.avgMetrics

Does this do the same as the Scikit Learn implementation? Why do the examples use cross-validation training / testing data?

How to crosscheck a RandomForest model?

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala

+11

machine-learning classification apache-spark-mllib cross-validation

other15 Jun 20 '16 at 9:43

source share

1 answer

Serendipity · Answer 1 · 2016-08-18T10:59:15+0000

What you do looks normal.
Basically, yes, it works the same as sklearn CV grid search .
For each EstimatorParamMaps (parameter set), the algorithm is checked with CV, therefore avgMetrics is the average metric of cross-check accuracy / s for all folds. In case one uses an empty ParamGridBuilder (without searching for parameters), he likes to have a “regular” cross-validation, and we get one cross-validated training accuracy.
Each CV iteration includes K-1 training bends and 1 test bend, so why do most examples split the data into training / testing data before doing cross-validation? because the test folds inside the CV are used to find the params grid. This means that an additional set of validation data is required to select a model . So, to evaluate the final model, the so-called “test data set” is required. More here

Spark K-fold Cross Validation

More articles: