Xgboost: handling missing values ​​for search by candidate

in section 3.4 , the authors explain in their article how they handle missing values ​​when looking for the best candidate splitting for tree growth. In particular, they create a default direction for those nodes that have a split function with missing values ​​in the current set of instances. During forecasting, if the forecasting path passes through this node and the function value is absent, the default should be followed.

However, the prediction phase will fail when the function values ​​are absent and the node has no default direction (and this can happen in many scenarios). In other words, how do they associate the default direction with all nodes, even with the missing split function in the active instance installed during training?

+4
source share
3 answers

xgboostalways takes into account the direction of separation of missing values, even if none of them are present - this is training. The default value yesis the split criteria. Then it turns out if anyone present in training

Link from the author

enter image description here

This can be observed by the following code

    require(xgboost)

    data(agaricus.train, package='xgboost')

    sum(is.na(agaricus.train$data))
    ##[1] 0  

    bst <- xgboost(data = agaricus.train$data, 
                       label = agaricus.train$label, 
                       max.depth = 4, 
                       eta = .01, 
                       nround = 100,
                       nthread = 2, 
                       objective = "binary:logistic")

dt <- xgb.model.dt.tree(model = bst)  ## records all the splits 

> head(dt)
     ID Feature        Split  Yes   No Missing      Quality   Cover Tree Yes.Feature Yes.Cover  Yes.Quality
1:  0-0      28 -1.00136e-05  0-1  0-2     0-1 4000.5300000 1628.25    0          55    924.50 1158.2100000
2:  0-1      55 -1.00136e-05  0-3  0-4     0-3 1158.2100000  924.50    0           7    679.75   13.9060000
3: 0-10    Leaf           NA   NA   NA      NA   -0.0198104  104.50    0          NA        NA           NA
4: 0-11       7 -1.00136e-05 0-15 0-16    0-15   13.9060000  679.75    0        Leaf    763.00    0.0195026
5: 0-12      38 -1.00136e-05 0-17 0-18    0-17   28.7763000   10.75    0        Leaf    678.75   -0.0199117
6: 0-13    Leaf           NA   NA   NA      NA    0.0195026  763.00    0          NA        NA           NA
   No.Feature No.Cover No.Quality
1:       Leaf   104.50 -0.0198104
2:         38    10.75 28.7763000
3:         NA       NA         NA
4:       Leaf     9.50 -0.0180952
5:       Leaf     1.00  0.0100000
6:         NA       NA         NA

> all(dt$Missing == dt$Yes,na.rm = T)
[1] TRUE

https://github.com/tqchen/xgboost/blob/8130778742cbdfa406b62de85b0c4e80b9788821/src/tree/model.h#L542

+7

, , . IE. . , . , , - , , GBM. , NA / , K , .

, , . , -, , . , , , . . , , - . , . . , , , , .

+1

, @Josiah. , , , , . , , GBM, , . , , - (< 10%) .

, , , : , . node f, , , . , f , node . , , .

, , , . , .

0

All Articles