I use the h2o package (v 3.6.0) in R, and I built a grid search model. Now I'm trying to access a model that minimizes MSE in a validation set. In python sklearn this is easily achievable when using RandomizedSearchCV :
#
Unfortunately, this is not so easy to prove in h2o. Here is an example that you can recover:
library(h2o)
The grid view prints a ton of information, including this part:
> grid ntrees distribution status_ok model_ids 50 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_1 25 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_0
With a little digging, you can access each individual model and view all the possible indicators:
> h2o.getModel( grid@model _ids[[1]]) H2OBinomialModel: gbm Model ID: Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_18_model_1 Model Summary: number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves 1 50 4387 1 1 1.00000 2 2 2.00000 H2OBinomialMetrics: gbm ** Reported on training data. ** MSE: 1.056927e-05 R^2: 0.9999577 LogLoss: 0.003256338 AUC: 1 Gini: 1 Confusion Matrix for F1-optimal threshold: setosa versicolor Error Rate setosa 50 0 0.000000 =0/50 versicolor 0 50 0.000000 =0/50 Totals 50 50 0.000000 =0/100 Maximum Metrics: Maximum metrics at their respective thresholds metric threshold value idx 1 max f1 0.996749 1.000000 0 2 max f2 0.996749 1.000000 0 3 max f0point5 0.996749 1.000000 0 4 max accuracy 0.996749 1.000000 0 5 max precision 0.996749 1.000000 0 6 max absolute_MCC 0.996749 1.000000 0 7 max min_per_class_accuracy 0.996749 1.000000 0
And with a lot of digging, you can finally get to this:
> h2o.getModel( grid@model _ids[[1]])@ model$training_metrics@metrics $MSE [1] 1.056927e-05
It seems like a lot of work to move on to the metric, which should be at the highest level for choosing a model (yes, I'm inserting my opinions now ...). In my situation, I have a grid with hundreds of models, and my current, hacked solution just doesn't seem very โR-esqueโ:
model_select_ <- function(grid) { model_ids <- grid@model _ids min = Inf best_model = NULL for(model_id in model_ids) { model <- h2o.getModel(model_id) mse <- model@model $training_metrics@metrics $MSE if(mse < min) { min <- mse best_model <- model } } best_model }
It is so utilitarian for something that is fundamental to the practice of machine learning, and it just amazes me so strange that h2o would not have a โcleanโ method of extracting the optimal model or, at least, model indicators.
Am I missing something? Is there a method out of the box for choosing the best model?