Pyspark - get all parameters of models created using ParamGridBuilder

Question

Pyspark - get all parameters of models created using ParamGridBuilder

I am using PySpark 2.0 for the Kaggle competition. I would like to know the behavior of the model ( RandomForest ) depending on different parameters. ParamGridBuilder() allows you to specify different values for individual parameters, and then perform (I think) a Cartesian product of the entire set of parameters. Assuming my DataFrame already defined:

 rdc = RandomForestClassifier() pipeline = Pipeline(stages=STAGES + [rdc]) paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20]) .addGrid(rdc.minInfoGain, [0.01, 0.001]) .addGrid(rdc.numTrees, [5, 10, 20, 30]) .build() evaluator = MulticlassClassificationEvaluator() valid = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, trainRatio=0.50) model = valid.fit(df) result = model.bestModel.transform(df)

So, now I can get simple information using manual work:

 def evaluate(result): predictionAndLabels = result.select("prediction", "label") metrics = ["f1","weightedPrecision","weightedRecall","accuracy"] for m in metrics: evaluator = MulticlassClassificationEvaluator(metricName=m) print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels)))

Now I want a few things:

What are the parameters of the best model? This post partially answers the question: How to extract model hyperparameters from spark.ml in PySpark?
What are the parameters of all models?
What are the results (e.g. recall, accuracy, etc.) of each model? I found print(model.validationMetrics) that displays (it seems) a list containing the accuracy of each model, but I cannot figure out which model it belongs to.

If I can get all this information, I should be able to display graphs, histograms and work just like with Panda and sklearn .

+1

python machine-learning pyspark apache-spark-ml hyperparameters

GwydionFR Sep 16 '16 at 10:17

source share

2 answers

I think I found a way to do this. I wrote a function that specifically pulls out hyperparameters for logistic regression, which has two parameters created using CrossValidator:

 def hyperparameter_getter(model_obj,cv_fold = 5.0): enet_list = [] reg_list = [] ## Get metrics metrics = model_obj.avgMetrics assert type(metrics) is list assert len(metrics) > 0 ## Get the paramMap element for x in range(len(model_obj._paramMap.keys())): if model_obj._paramMap.keys()[x].name=='estimatorParamMaps': param_map_key = model_obj._paramMap.keys()[x] params = model_obj._paramMap[param_map_key] for i in range(len(params)): for k in params[i].keys(): if k.name =='elasticNetParam': enet_list.append(params[i][k]) if k.name =='regParam': reg_list.append(params[i][k]) results_df = pd.DataFrame({'metrics':metrics, 'elasticNetParam': enet_list, 'regParam':reg_list}) # Because of [SPARK-16831][PYTHON] # It only sums across folds, doesn't average spark_version = [int(x) for x in sc.version.split('.')] if spark_version[0] <= 2: if spark_version[1] < 1: results_df.metrics = 1.0*results_df['metrics'] / cv_fold return results_df

0

Patrick mccarthy Jan 05 '17 at 15:31

source share

zero323 · Accepted Answer · 2016-09-16T12:06:26+0000

In short, you simply cannot get parameters for all models, because like CrossValidator , TrainValidationSplitModel only stores the best model. These classes are intended for semi-automatic model selection, and not for research or experimentation.

What are the parameters of all models?

Until you can get the actual validationMetrics models match the Params input so you can just zip both:

 from typing import Dict, Tuple, List, Any from pyspark.ml.param import Param from pyspark.ml.tuning import TrainValidationSplitModel EvalParam = List[Tuple[float, Dict[Param, Any]]] def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam: return list(zip(model.validationMetrics, model.getEstimatorParamMaps()))

to learn about the relationship between metrics and parameters.

If you need more information, you should use Pipeline Params . It will save the entire model, which can be used for further processing:

 models = pipeline.fit(df, params=paramGrid)

It will generate a PipelineModels list matching the Params argument:

 zip(models, params)

Pyspark - get all parameters of models created using ParamGridBuilder

More articles: