Pyspark - get all parameters of models created using ParamGridBuilder

I am using PySpark 2.0 for the Kaggle competition. I would like to know the behavior of the model ( RandomForest ) depending on different parameters. ParamGridBuilder() allows you to specify different values โ€‹โ€‹for individual parameters, and then perform (I think) a Cartesian product of the entire set of parameters. Assuming my DataFrame already defined:

 rdc = RandomForestClassifier() pipeline = Pipeline(stages=STAGES + [rdc]) paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20]) .addGrid(rdc.minInfoGain, [0.01, 0.001]) .addGrid(rdc.numTrees, [5, 10, 20, 30]) .build() evaluator = MulticlassClassificationEvaluator() valid = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, trainRatio=0.50) model = valid.fit(df) result = model.bestModel.transform(df) 

So, now I can get simple information using manual work:

 def evaluate(result): predictionAndLabels = result.select("prediction", "label") metrics = ["f1","weightedPrecision","weightedRecall","accuracy"] for m in metrics: evaluator = MulticlassClassificationEvaluator(metricName=m) print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels))) 

Now I want a few things:

  • What are the parameters of the best model? This post partially answers the question: How to extract model hyperparameters from spark.ml in PySpark?
  • What are the parameters of all models?
  • What are the results (e.g. recall, accuracy, etc.) of each model? I found print(model.validationMetrics) that displays (it seems) a list containing the accuracy of each model, but I cannot figure out which model it belongs to.

If I can get all this information, I should be able to display graphs, histograms and work just like with Panda and sklearn .

+1
source share
2 answers

In short, you simply cannot get parameters for all models, because like CrossValidator , TrainValidationSplitModel only stores the best model. These classes are intended for semi-automatic model selection, and not for research or experimentation.

What are the parameters of all models?

Until you can get the actual validationMetrics models match the Params input so you can just zip both:

 from typing import Dict, Tuple, List, Any from pyspark.ml.param import Param from pyspark.ml.tuning import TrainValidationSplitModel EvalParam = List[Tuple[float, Dict[Param, Any]]] def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam: return list(zip(model.validationMetrics, model.getEstimatorParamMaps())) 

to learn about the relationship between metrics and parameters.

If you need more information, you should use Pipeline Params . It will save the entire model, which can be used for further processing:

 models = pipeline.fit(df, params=paramGrid) 

It will generate a PipelineModels list matching the Params argument:

 zip(models, params) 
+2
source

I think I found a way to do this. I wrote a function that specifically pulls out hyperparameters for logistic regression, which has two parameters created using CrossValidator:

 def hyperparameter_getter(model_obj,cv_fold = 5.0): enet_list = [] reg_list = [] ## Get metrics metrics = model_obj.avgMetrics assert type(metrics) is list assert len(metrics) > 0 ## Get the paramMap element for x in range(len(model_obj._paramMap.keys())): if model_obj._paramMap.keys()[x].name=='estimatorParamMaps': param_map_key = model_obj._paramMap.keys()[x] params = model_obj._paramMap[param_map_key] for i in range(len(params)): for k in params[i].keys(): if k.name =='elasticNetParam': enet_list.append(params[i][k]) if k.name =='regParam': reg_list.append(params[i][k]) results_df = pd.DataFrame({'metrics':metrics, 'elasticNetParam': enet_list, 'regParam':reg_list}) # Because of [SPARK-16831][PYTHON] # It only sums across folds, doesn't average spark_version = [int(x) for x in sc.version.split('.')] if spark_version[0] <= 2: if spark_version[1] < 1: results_df.metrics = 1.0*results_df['metrics'] / cv_fold return results_df 
0
source

All Articles