I am using PySpark 2.0 for the Kaggle competition. I would like to know the behavior of the model ( RandomForest ) depending on different parameters. ParamGridBuilder() allows you to specify different values โโfor individual parameters, and then perform (I think) a Cartesian product of the entire set of parameters. Assuming my DataFrame already defined:
rdc = RandomForestClassifier() pipeline = Pipeline(stages=STAGES + [rdc]) paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20]) .addGrid(rdc.minInfoGain, [0.01, 0.001]) .addGrid(rdc.numTrees, [5, 10, 20, 30]) .build() evaluator = MulticlassClassificationEvaluator() valid = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, trainRatio=0.50) model = valid.fit(df) result = model.bestModel.transform(df)
So, now I can get simple information using manual work:
def evaluate(result): predictionAndLabels = result.select("prediction", "label") metrics = ["f1","weightedPrecision","weightedRecall","accuracy"] for m in metrics: evaluator = MulticlassClassificationEvaluator(metricName=m) print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels)))
Now I want a few things:
- What are the parameters of the best model? This post partially answers the question: How to extract model hyperparameters from spark.ml in PySpark?
- What are the parameters of all models?
- What are the results (e.g. recall, accuracy, etc.) of each model? I found
print(model.validationMetrics) that displays (it seems) a list containing the accuracy of each model, but I cannot figure out which model it belongs to.
If I can get all this information, I should be able to display graphs, histograms and work just like with Panda and sklearn .
source share