SPARK, ML, Tuning, CrossValidator: access to metrics

To build a multiclass NaiveBayes classifier, I use CrossValidator to select the best options in my pipeline:

val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvModel = cv.fit(trainingSet) 

The pipeline contains ordinary transformers and ratings in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF, and finally NaiveBayes.

Is it possible to access metrics calculated for the best model?

Ideally, I would like to access the metrics of all models to see how changing the parameters changes the quality of the classification. But at the moment, the best model is good enough.

FYI, I am using Spark 1.6.0

+8
apache-spark apache-spark-mllib apache-spark-ml
source share
2 answers

Here is how I do it:

 val pipeline = new Pipeline() .setStages(Array(tokenizer, stopWordsFilter, tf, idf, word2Vec, featureVectorAssembler, categoryIndexerModel, classifier, categoryReverseIndexer)) ... val paramGrid = new ParamGridBuilder() .addGrid(tf.numFeatures, Array(10, 100)) .addGrid(idf.minDocFreq, Array(1, 10)) .addGrid(word2Vec.vectorSize, Array(200, 300)) .addGrid(classifier.maxDepth, Array(3, 5)) .build() paramGrid.size // 16 entries ... // Print the average metrics per ParamGrid entry val avgMetricsParamGrid = crossValidatorModel.avgMetrics // Combine with paramGrid to see how they affect the overall metrics val combined = paramGrid.zip(avgMetricsParamGrid) ... val bestModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel] // Explain params for each stage val bestHashingTFNumFeatures = bestModel.stages(2).asInstanceOf[HashingTF].explainParams val bestIDFMinDocFrequency = bestModel.stages(3).asInstanceOf[IDFModel].explainParams val bestWord2VecVectorSize = bestModel.stages(4).asInstanceOf[Word2VecModel].explainParams val bestDecisionTreeDepth = bestModel.stages(7).asInstanceOf[DecisionTreeClassificationModel].explainParams 
+6
source share
  cvModel.avgMetrics 

works in pyspark 2.2.0

0
source share

All Articles