I participate in Coursera machine learning courses, and course work requires the creation of predictive models using this dataset . After splitting the data into training and testing data arrays based on the result of interest (at the same time, y marked, but this is actually the classe variable in the data set):
inTrain <- createDataPartition(y = data$y, p = 0.75, list = F) training <- data[inTrain, ] testing <- data[-inTrain, ]
I tried 2 different methods:
modFit <- caret::train(y ~ ., method = "rpart", data = training) pred <- predict(modFit, newdata = testing) confusionMatrix(pred, testing$y)
against.
modFit <- rpart::rpart(y ~ ., data = training) pred <- predict(modFit, newdata = testing, type = "class") confusionMatrix(pred, testing$y)
I would suggest that they give the same or very similar results, since the original method loads the "rpart" package (assuming it uses this package for this method). However, the timings ( caret much slower) and the results are very different:
Method 1 (caret) :
Confusion Matrix and Statistics Reference Prediction ABCDE A 1264 374 403 357 118 B 25 324 28 146 124 C 105 251 424 301 241 D 0 0 0 0 0 E 1 0 0 0 418
Method 2 (rpart) :
Confusion Matrix and Statistics Reference Prediction ABCDE A 1288 176 14 79 25 B 36 569 79 32 68 C 31 88 690 121 113 D 14 66 52 523 44 E 26 50 20 49 651
As you can see, the second approach is the best classifier - the first method is very bad for classes D and E.
I understand that this may not be the right place to ask this question, but I would really appreciate a deeper understanding of this and related issues. caret seems like a great package for unifying methods and call syntax, but now I hesitate to use it.