Understanding the tree structure in R gbm package

I had difficulty understanding how trees are structured in an R gbm gradient enlarged machine package. In particular, looking at the output of pretty.gbm.tree What function indexes in SplitVar point to ?

I trained GBM in a data set, here the top quarter of one of my trees is the result of calling pretty.gbm.tree :

  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction 0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061 1 -1 1.895699e-12 -1 -1 -1 0.0000000 3013 0.018956988 2 31 4.462500e+02 3 4 20 1.0083722 2968 -0.009168477 3 -1 1.388483e-22 -1 -1 -1 0.0000000 1430 0.013884830 4 38 5.500000e+00 5 18 19 1.5748155 1538 -0.030602956 5 24 7.530000e+03 6 13 17 2.8329899 361 -0.078738904 6 41 2.750000e+01 7 11 12 2.2499063 334 -0.064752766 7 28 -3.155000e+02 8 9 10 1.5516610 57 -0.243675567 8 -1 -3.379312e-11 -1 -1 -1 0.0000000 45 -0.337931219 9 -1 1.922333e-10 -1 -1 -1 0.0000000 12 0.109783128 ``` 

Here it seems to me that indexes are based on 0: from how LeftNode, RightNode and MissingNode point to different rows. When you verify this using data samples and following it through the tree until they are predicted, I get the correct answer when I consider that SplitVar uses indexing 1 .

However, 1 of the many trees I'm SplitVar has zero in the SplitVar column! Here is the tree:

 SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction 0 4 1.462500e+02 1 2 21 0.41887 5981 0.0021651262 1 -1 4.117688e-22 -1 -1 -1 0.00000 512 0.0411768781 2 4 1.472500e+02 3 4 20 1.05222 5469 -0.0014870985 3 -1 -2.062798e-11 -1 -1 -1 0.00000 23 -0.2062797579 4 0 4.750000e+00 5 6 19 0.65424 5446 -0.0006222011 5 -1 3.564879e-23 -1 -1 -1 0.00000 4897 0.0035648788 6 28 -3.195000e+02 7 11 18 1.39452 549 -0.0379703437 

What is the correct way to view the indexing used by gbm trees?

+5
source share
1 answer

The first column that prints when using pretty.gbm.tree is row.names , which is assigned in the pretty.gbm.tree.R script. In the script, row.names is assigned as row.names(temp) <- 0:(nrow(temp)-1) , where temp is the tree information stored in the form data.frame . The correct way to interpret row.names is to read it as node_id with the root of the node that is set to 0.

In your example:

Id SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction 0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061

means that the root of the node (indicated by line number 0) is split into the 9th separation variable (the numbering of the split variable starts at 0 here, so the split variable is the 10th column in set x training). SplitCodePred of 6.25 means that all points less than 6.25 went to LeftNode 1 , and all points greater than 6.25 went to RightNode 2 . All points with a missing value in this column were bound to MissingNode 21 . ErrorReduction was 0.6634 due to this separation and at the root of the node was 5981 ( Weight ). Prediction of 0.005 denotes the value assigned to all values ​​in this node, before the point was split. In the case of terminal nodes (or leaves) indicated by -1 in SplitVar , LeftNode , RightNode and MissingNode , Prediction denotes the value predicted for all points belonging to this leaf node adjusted (times) times shrinkage .

To understand the structure of the tree, it is important to note that the splitting of the tree occurs in the depths of the first mode. Therefore, when the root of a node (with node id 0) is divided into its left node and right node, the left side is processed until no further splits occur before returning, and return to the correct node value. On both trees in your example, RightNode gets the value 2. This is because in both cases the LeftNode turns out to be a leaf node.

+7
source

All Articles