How is xgboost coverage calculated?

Can someone explain how the Cover column is calculated in the xgboost R package in the xgboost function?

The documentation says Cover "is a metric for measuring the number of observations affected by a split."

When you run the following code given in the xgboost documentation for this function, Cover for node 0 of tree 0 is 1628.2500.

 data(agaricus.train, package='xgboost') #Both dataset are list with two items, a sparse matrix and labels #(labels = outcome column which will be learned). #Each column of the sparse Matrix is a feature in one hot encoding format. train <- agaricus.train bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") # agaricus.test$data@Dimnames [[2]] represents the column names of the sparse matrix. xgb.model.dt.tree( agaricus.train$data@Dimnames [[2]], model = bst) 

There are 6513 observations in the train dataset, so can anyone explain why Cover for node 0 of tree 0 is a quarter of that number (1628.25)?

In addition, Cover for node 1 of tree 1 - 788.852 - how is this number calculated?

Any help would be greatly appreciated. Thanks.

+7
r xgboost
source share
1 answer

The cover is defined in xgboost as:

the sum of the second-order gradient of the training data classified in the sheet, if it is a square loss, it simply corresponds to the number of copies in this branch. The deeper the a node tree, the lower the metric will be

https://github.com/dmlc/xgboost/blob/f5659e17d5200bd7471a2e735177a81cb8d3012b/R-package/man/xgb.plot.tree.Rd Not particularly well documented ....

To calculate the coverage, we need to know the predictions at this point in the tree and the second derivative with respect to the loss function .

Fortunately for us, the forecast for each data point (of which 6513) at 0-0 node in your example is .5. This is the global default setting in which your first prediction at t = 0 is .5.

base_score [default = 0.5] initial forecast estimate of all examples, global offset

http://xgboost.readthedocs.org/en/latest/parameter.html

The gradient of binary logistics (which is your target function) is py, where p = your prediction and y = true label.

So the hessian (which we need for this) is p * (1-p). Note. Hessian can be defined without y, real marks.

So (bringing him home):

6513 * (.5) * (1 -.5) = 1628.25

In the second prediction tree, at this point, not all .5, sp allows one to obtain predictions after one tree

 p = predict(bst,newdata = train$data, ntree=1) head(p) [1] 0.8471184 0.1544077 0.1544077 0.8471184 0.1255700 0.1544077 sum(p*(1-p)) # sum of the hessians in that node,(root node has all data) [1] 788.8521 

Note that for linear (quadratic error) hessian regression is always the same, so the cover shows how many examples are on this sheet.

A large takeaway is that the cap is determined by the Hessian of the objective function. A lot of information on how to get to the gradient, and hessian binary logistic function.

These slides are useful because he sees why he uses hessians as a weight, and also explains how xgboost differs from standard trees. https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

+12
source share

All Articles