What is the measure used for "importance" in a random h2o forest

Here is my code:

set.seed(1) #Boruta on the HouseVotes84 data from mlbench library(mlbench) #has HouseVotes84 data library(h2o) #has rf #spin up h2o myh20 <- h2o.init(nthreads = -1) #read in data, throw some away data(HouseVotes84) hvo <- na.omit(HouseVotes84) #move from R to h2o mydata <- as.h2o(x=hvo, destination_frame= "mydata") #RF columns (input vs. output) idxy <- 1 idxx <- 2:ncol(hvo) #split data splits <- h2o.splitFrame(mydata, c(0.8,0.1)) train <- h2o.assign(splits[[1]], key="train") valid <- h2o.assign(splits[[2]], key="valid") # make random forest my_imp.rf<- h2o.randomForest(y=idxy,x=idxx, training_frame = train, validation_frame = valid, model_id = "my_imp.rf", ntrees=200) # find importance my_varimp <- h2o.varimp(my_imp.rf) my_varimp 

The result I get is "variable importance."

Classical measures are "average decrease in accuracy" and "average decrease in Gini coefficient".

My results:

 > my_varimp Variable Importances: variable relative_importance scaled_importance percentage 1 V4 3255.193604 1.000000 0.410574 2 V5 1131.646484 0.347643 0.142733 3 V3 921.106567 0.282965 0.116178 4 V12 759.443176 0.233302 0.095788 5 V14 492.264954 0.151224 0.062089 6 V8 342.811554 0.105312 0.043238 7 V11 205.392654 0.063097 0.025906 8 V9 191.110046 0.058709 0.024105 9 V7 169.117676 0.051953 0.021331 10 V15 135.097076 0.041502 0.017040 11 V13 114.906586 0.035299 0.014493 12 V2 51.939777 0.015956 0.006551 13 V10 46.716656 0.014351 0.005892 14 V6 44.336708 0.013620 0.005592 15 V16 34.779987 0.010684 0.004387 16 V1 32.528778 0.009993 0.004103 

Hence my relative importance of Voting No. 4, aka V4, is ~ 3255.2.

Questions: What are these units? How does this happen?

I tried looking in the documentation, but could not find the answer. I tried the help documentation. I tried using Flow to look at the parameters to see what is indicated there. In none of them do I find "gini" or "reduce accuracy." Where should I look?

+6
source share
1 answer

The answer is in docs .

[In the left pane, click Algorithms, then Slave, then DRF. This FAQ answers the frequently asked questions section. ]

For convenience, the answer is also copied and pasted here:

"How is the value of a variable calculated for DRF? The value of a variable is determined by calculating the relative influence of each variable: whether this variable was selected during the division during the tree construction and how much the square error (for all trees) is improved, the result."

+3
source

All Articles