H2O R api: getting the optimal model from the grid

I use the h2o package (v 3.6.0) in R, and I built a grid search model. Now I'm trying to access a model that minimizes MSE in a validation set. In python sklearn this is easily achievable when using RandomizedSearchCV :

 ## Pseudo code: grid = RandomizedSearchCV(model, params, n_iter = 5) grid.fit(X) best = grid.best_estimator_ 

Unfortunately, this is not so easy to prove in h2o. Here is an example that you can recover:

 library(h2o) ## assume you got h2o initialized... X <- as.h2o(iris[1:100,]) # Note: only using top two classes for example grid <- h2o.grid( algorithm = 'gbm', x = names(X[,1:4]), y = 'Species', training_frame = X, hyper_params = list( distribution = 'bernoulli', ntrees = c(25,50) ) ) 

The grid view prints a ton of information, including this part:

 > grid ntrees distribution status_ok model_ids 50 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_1 25 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_0 

With a little digging, you can access each individual model and view all the possible indicators:

 > h2o.getModel( grid@model _ids[[1]]) H2OBinomialModel: gbm Model ID: Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_18_model_1 Model Summary: number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves 1 50 4387 1 1 1.00000 2 2 2.00000 H2OBinomialMetrics: gbm ** Reported on training data. ** MSE: 1.056927e-05 R^2: 0.9999577 LogLoss: 0.003256338 AUC: 1 Gini: 1 Confusion Matrix for F1-optimal threshold: setosa versicolor Error Rate setosa 50 0 0.000000 =0/50 versicolor 0 50 0.000000 =0/50 Totals 50 50 0.000000 =0/100 Maximum Metrics: Maximum metrics at their respective thresholds metric threshold value idx 1 max f1 0.996749 1.000000 0 2 max f2 0.996749 1.000000 0 3 max f0point5 0.996749 1.000000 0 4 max accuracy 0.996749 1.000000 0 5 max precision 0.996749 1.000000 0 6 max absolute_MCC 0.996749 1.000000 0 7 max min_per_class_accuracy 0.996749 1.000000 0 

And with a lot of digging, you can finally get to this:

 > h2o.getModel( grid@model _ids[[1]])@ model$training_metrics@metrics $MSE [1] 1.056927e-05 

It seems like a lot of work to move on to the metric, which should be at the highest level for choosing a model (yes, I'm inserting my opinions now ...). In my situation, I have a grid with hundreds of models, and my current, hacked solution just doesn't seem very โ€œR-esqueโ€:

 model_select_ <- function(grid) { model_ids <- grid@model _ids min = Inf best_model = NULL for(model_id in model_ids) { model <- h2o.getModel(model_id) mse <- model@model $training_metrics@metrics $MSE if(mse < min) { min <- mse best_model <- model } } best_model } 

It is so utilitarian for something that is fundamental to the practice of machine learning, and it just amazes me so strange that h2o would not have a โ€œcleanโ€ method of extracting the optimal model or, at least, model indicators.

Am I missing something? Is there a method out of the box for choosing the best model?

+6
source share
3 answers

Yes, there is an easy way to extract the โ€œtopโ€ H2O network search model. There are also utility functions that will retrieve all the model metrics (for example, h2o.mse ) that you tried to access. Examples of how to do this can be found in h2o-r / demos and h2o-py / demos in the h2o-3 GitHub repo.

Since you are using R, here is an appropriate code example that includes a grid search with sorted results. You can also find how to access this information in the R documentation for the h2o.getGrid function.

Print auc for all models sorted by AUC validation:

 auc_table <- h2o.getGrid(grid_id = "eeg_demo_gbm_grid", sort_by = "auc", decreasing = TRUE) print(auc_table) 

Here is an example output:

 H2O Grid Details ================ Grid ID: eeg_demo_gbm_grid Used hyper parameters: - ntrees - max_depth - learn_rate Number of models: 18 Number of failed models: 0 Hyper-Parameter Search Summary: ordered by decreasing auc ntrees max_depth learn_rate model_ids auc 1 100 5 0.2 eeg_demo_gbm_grid_model_17 0.967771493797284 2 50 5 0.2 eeg_demo_gbm_grid_model_16 0.949609591795923 3 100 5 0.1 eeg_demo_gbm_grid_model_8 0.94941792664595 4 50 5 0.1 eeg_demo_gbm_grid_model_7 0.922075196552274 5 100 3 0.2 eeg_demo_gbm_grid_model_14 0.913785959685157 6 50 3 0.2 eeg_demo_gbm_grid_model_13 0.887706691652792 7 100 3 0.1 eeg_demo_gbm_grid_model_5 0.884064379717198 8 5 5 0.2 eeg_demo_gbm_grid_model_15 0.851187402678818 9 50 3 0.1 eeg_demo_gbm_grid_model_4 0.848921799270639 10 5 5 0.1 eeg_demo_gbm_grid_model_6 0.825662907513139 11 100 2 0.2 eeg_demo_gbm_grid_model_11 0.812030639460551 12 50 2 0.2 eeg_demo_gbm_grid_model_10 0.785379521713437 13 100 2 0.1 eeg_demo_gbm_grid_model_2 0.78299280750123 14 5 3 0.2 eeg_demo_gbm_grid_model_12 0.774673686150002 15 50 2 0.1 eeg_demo_gbm_grid_model_1 0.754834657912535 16 5 3 0.1 eeg_demo_gbm_grid_model_3 0.749285131682721 17 5 2 0.2 eeg_demo_gbm_grid_model_9 0.692702793188135 18 5 2 0.1 eeg_demo_gbm_grid_model_0 0.676144542037133 

The top row in the table contains the model with the best AUC, so below we can capture this model and extract AUC confirmation:

 best_model <- h2o.getModel( auc_table@model _ids[[1]]) h2o.auc(best_model, valid = TRUE) 

So that the h2o.getGrid function h2o.getGrid be sorted by metric in the check set, you need to actually pass the h2o.grid function a validation_frame . In the above example, you did not pass validation_frame validation, so you cannot evaluate models in the grid in the validation set.

+4
source

This seems to be true only for the latest versions of h2o, with 3.8.2.3 you get a Java exception saying that "auc" is an invalid metric. Failed to complete the following:

 library(h2o) library(jsonlite) h2o.init() iris.hex <- as.h2o(iris) h2o.grid("gbm", grid_id = "gbm_grid_id", x = c(1:4), y = 5, training_frame = iris.hex, hyper_params = list(ntrees = c(1,2,3))) grid <- h2o.getGrid("gbm_grid_id", sort_by = "auc", decreasing = T) 

However, replace 'auc' with 'logloss' and decrease = F, which is great.

+3
source

Unfortunately, the H2O grid function uses training_frame not validation_frame when you pass both of them. Consequently, the winning model is extremely overloaded and useless. EDIT: Well, here's the fix, itโ€™s actually useful to have a very low training bias, for this, to analyze the curve of the curve and the bias compared to the analysis of variances. But to be clear, I also need to be able to run again and get a validation dataset that will be used as search criteria for the final model setup and selection.

For example, here is a winning model from the grid function on GBM where validation_frame was accepted and AUC is the search mark. You can see that validation_auc starts at 0.5 and actually worsens to 0.44 in the final winning history of the winning model:

 Scoring History: timestamp duration number_of_trees training_rmse 1 2017-02-06 10:09:19 6 min 13.153 sec 0 0.70436 2 2017-02-06 10:09:23 6 min 16.863 sec 100 0.70392 3 2017-02-06 10:09:27 6 min 20.950 sec 200 0.70343 4 2017-02-06 10:09:31 6 min 24.806 sec 300 0.70289 5 2017-02-06 10:09:35 6 min 29.244 sec 400 0.70232 6 2017-02-06 10:09:39 6 min 33.069 sec 500 0.70171 7 2017-02-06 10:09:43 6 min 37.243 sec 600 0.70107 training_logloss training_auc training_lift training_classification_error 1 2.77317 0.50000 1.00000 0.49997 2 2.69896 0.99980 99.42857 0.00026 3 2.62768 0.99980 99.42857 0.00020 4 2.55902 0.99982 99.42857 0.00020 5 2.49675 0.99993 99.42857 0.00020 6 2.43712 0.99994 99.42857 0.00020 7 2.38071 0.99994 99.42857 0.00013 validation_rmse validation_logloss validation_auc validation_lift 1 0.06921 0.03058 0.50000 1.00000 2 0.06921 0.03068 0.45944 9.03557 3 0.06922 0.03085 0.46685 9.03557 4 0.06922 0.03107 0.46817 9.03557 5 0.06923 0.03133 0.45656 9.03557 6 0.06924 0.03163 0.44947 9.03557 7 0.06924 0.03192 0.44400 9.03557 validation_classification_error 1 0.99519 2 0.00437 3 0.00656 4 0.00656 5 0.00700 6 0.00962 7 0.00962 
-2
source

All Articles