Understanding the minbucket function in a CART model using R

Suppose the training data is the β€œfruits” that I will use to predict using the CART model in R

> fruit=data.frame( color=c("red", "red", "red", "yellow", "red","yellow", "orange","green","pink", "red",β€Œ ​"red"), isApple=c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,FALSE,FALSE,FALSE,TRUE)) > mod = rpart(isApple ~ color, data=fruit, method="class", minbucket=1) > prp(mod) 

Can someone explain what is the role of minbucket in building the CART tree for this example if we use minbucket = 2, 3, 4, 5?

I have 2 colors of variables and isApple. The color variable has green, yellow, pink, orange, and red. Apple variable is set to TRUE or FALSE. In the last example, RED has three TRUE and 2 FALSE mapped to it. The red value appears five times. if I give minbucket = 1,2,3 then it splits. If I give minbucket = 4 or 5, then no split will occur, although red appears five times.

+5
source share
1 answer

From the documentation for the rpart package:

minbucket

minimum number of observations on any terminal node. If only one of minbucket or minsplit is specified, the code either sets minsplit tominbucket * 3, or minbucket to minsplit / 3, if necessary.

Setting minbucket to 1 does not make sense, since each leaf node will (by definition) have at least one observation on it. If you set it to a higher value, say 3, then this will mean that each leaf node will have at least 3 observations in this bucket.

The lower the minbucket value, the more accurate your CART model will be. Setting minbucket to a value that is too small, such as 1, may run the risk of reassigning your model.

+5
source

All Articles