I work with a data set of approximately 1.5 million cases. I find that the regression tree works (I use the mob() * function from the party package) on a more than small subset of my data takes a very long time (I can not work on a subset more than 50 kb).
I can think of two main problems that slow down the calculation.
- Splits are calculated at each step using the entire data set. I would be pleased with the results that chose a variable to split on each node based on a random subset of data if it continues to replenish the sample size in each subnode in the tree.
- The operation is not parallelized. It seems to me that as soon as the tree made it first, it should use two processors, so that by the time there are 16 partitions, each of the processors on my machine will be used. In practice, it seems that only one of them gets used to it.
Does anyone have suggestions for alternative tree implementations that work better for large datasets or for things I could change to speed up the calculation **?
* I use mob() because I want to fit linear regression at the bottom of each node in order to split the data based on their response to the variable being processed.
** One thing that seems to slow down the calculations is that I have a factor variable with 16 types. Calculating which subset of the variable to split seems to take a lot longer than other splits (since there are so many different ways to group them). This variable is one that we consider important, so I am reluctant to abandon it. Is there a recommended way to group types into fewer values ββbefore putting them in a tree model?
parallel-processing r regression cart-analysis large-data
Rob donnelly
source share