Running a regression tree on a large dataset in R

Question

Running a regression tree on a large dataset in R

I work with a data set of approximately 1.5 million cases. I find that the regression tree works (I use the mob() * function from the party package) on a more than small subset of my data takes a very long time (I can not work on a subset more than 50 kb).

I can think of two main problems that slow down the calculation.

Splits are calculated at each step using the entire data set. I would be pleased with the results that chose a variable to split on each node based on a random subset of data if it continues to replenish the sample size in each subnode in the tree.
The operation is not parallelized. It seems to me that as soon as the tree made it first, it should use two processors, so that by the time there are 16 partitions, each of the processors on my machine will be used. In practice, it seems that only one of them gets used to it.

Does anyone have suggestions for alternative tree implementations that work better for large datasets or for things I could change to speed up the calculation **?

* I use mob() because I want to fit linear regression at the bottom of each node in order to split the data based on their response to the variable being processed.

** One thing that seems to slow down the calculations is that I have a factor variable with 16 types. Calculating which subset of the variable to split seems to take a lot longer than other splits (since there are so many different ways to group them). This variable is one that we consider important, so I am reluctant to abandon it. Is there a recommended way to group types into fewer values before putting them in a tree model?

+7

parallel-processing r regression cart-analysis large-data

Rob donnelly Sep 09 '13 at 19:11

source share

1 answer

Statwonk · Accepted Answer · 2013-09-20T16:17:47+0000

My answer comes from the class that I used that used these slides (see slide 20) .

It is argued that there is no easy way to deal with categorical predictors with a large number of categories. In addition, I know that decision trees and random forests will automatically prefer categorical predictors with many categories.

Some recommended solutions:

Add your categorical predictor to fewer boxes (which still matter to you).
Order the predictor according to the means (slide 20). This is my professor recommendation. But where does this lead, I use ordered factor in R
Finally, you need to be careful about the impact of this categorical predictor. For example, I know what you can do with the randomForest package to set the randomForest mtry parameter to a lower number. This controls the number of variables that the algorithm scans for each partition. When it is installed below, you will have fewer instances of your categorical predictor versus other variables. This will speed up the evaluation time and allow you to use decorrelation from the randomForest method randomForest that you do not overload your categorical variable.

Finally, I would recommend looking at the MARS or PRIM methods. My professor has some slides on this here . I know that PRIM is known for being low in computing requirement.

Running a regression tree on a large dataset in R

More articles: