I am having some problems understanding spark check. Any example that I saw uses it to configure the parameters, but I assumed that it would also perform the usual cross-validation in K-fold?
What I want to do is cross-check in k-fold order, where k = 5. I want to get the accuracy for each result, and then get the average accuracy. In scikit, learn how to do this, where the results will give you the result for each fold, and then you can use scores.mean ()
scores = cross_val_score(classifier, y, x, cv=5, scoring='accuracy')
This is how I do it in Spark, paramGridBuilder is empty, since I don't want to enter any parameters.
val paramGrid = new ParamGridBuilder().build() val evaluator = new MulticlassClassificationEvaluator() evaluator.setLabelCol("label") evaluator.setPredictionCol("prediction") evaluator.setMetricName("precision") val crossval = new CrossValidator() crossval.setEstimator(classifier) crossval.setEvaluator(evaluator) crossval.setEstimatorParamMaps(paramGrid) crossval.setNumFolds(5) val modelCV = crossval.fit(df4) val chk = modelCV.avgMetrics
Does this do the same as the Scikit Learn implementation? Why do the examples use cross-validation training / testing data?
How to crosscheck a RandomForest model?
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala
source share