Why are the results using the caret :: train (..., method = "rpart") method different from rpart :: rpart (...)?

Question

Why are the results using the caret :: train (..., method = "rpart") method different from rpart :: rpart (...)?

I participate in Coursera machine learning courses, and course work requires the creation of predictive models using this dataset . After splitting the data into training and testing data arrays based on the result of interest (at the same time, y marked, but this is actually the classe variable in the data set):

 inTrain <- createDataPartition(y = data$y, p = 0.75, list = F) training <- data[inTrain, ] testing <- data[-inTrain, ]

I tried 2 different methods:

 modFit <- caret::train(y ~ ., method = "rpart", data = training) pred <- predict(modFit, newdata = testing) confusionMatrix(pred, testing$y)

against.

 modFit <- rpart::rpart(y ~ ., data = training) pred <- predict(modFit, newdata = testing, type = "class") confusionMatrix(pred, testing$y)

I would suggest that they give the same or very similar results, since the original method loads the "rpart" package (assuming it uses this package for this method). However, the timings ( caret much slower) and the results are very different:

Method 1 (caret) :

 Confusion Matrix and Statistics Reference Prediction ABCDE A 1264 374 403 357 118 B 25 324 28 146 124 C 105 251 424 301 241 D 0 0 0 0 0 E 1 0 0 0 418

Method 2 (rpart) :

 Confusion Matrix and Statistics Reference Prediction ABCDE A 1288 176 14 79 25 B 36 569 79 32 68 C 31 88 690 121 113 D 14 66 52 523 44 E 26 50 20 49 651

As you can see, the second approach is the best classifier - the first method is very bad for classes D and E.

I understand that this may not be the right place to ask this question, but I would really appreciate a deeper understanding of this and related issues. caret seems like a great package for unifying methods and call syntax, but now I hesitate to use it.

+7

r r-caret rpart

Jonny polonsky Mar 20 '15 at 13:09

source share

1 answer

Peyton · Accepted Answer · 2015-03-20T16:41:34+0000

caret really a bit under the hood. In particular, it uses cross-validation to optimize model hyperparameters . In your case, it tries three cp values (type modFit , and you will see the accuracy results for each value), while rpart just uses 0.01 unless you specify it otherwise (see ?rpart.control ). Cross validation will also take longer, especially since caret uses the default download.

To get similar results, you need to disable cross-checking and specify cp :

 modFit <- caret::train(y ~ ., method = "rpart", data = training, trControl=trainControl(method="none"), tuneGrid=data.frame(cp=0.01))

In addition, you must use the same random seed for both models.

However, the extra functionality provided by caret is a good thing, and you probably should just go with caret . If you want to know more, this is well documented, and the author has a star book called Applied Predictive Modeling.

Why are the results using the caret :: train (..., method = "rpart") method different from rpart :: rpart (...)?

More articles: