How to collapse RandomForest into an equivalent decision tree?

As I understand it, when creating a random forest, the algorithm combines a bunch of randomly generated decision trees, weighing them so that they match the training data.

Can we say that this average value of forests can be simplified to a simple decision tree? And if so, how can I access and submit this tree?

What I want to do here is to extract the information in the tree to help identify both the leading attributes, their boundary values, and the location in the tree. I assume that such a tree will provide an understanding of the human (or computer heuristic) as to which attributes in the data set provide the most understanding of the definition of the target result.

This is probably a naive question - and if so, be patient, I'm new to this and want to go on stage where I understand it enough.

+5
source share
1 answer

RandomForest uses bootstrap to create many sets of workouts by fetching data with bagging . Each boot set is very close to the source data, but slightly different, since it may have a multiplicity of some points, and some other points in the source data will be absent. (This helps to create a whole bunch of similar, but different sets, which in general are a collection of your data and provide a better generalization)

Then it is suitable for DecisionTree for each set. However, what DecisionTree does regularly at every step is to iterate over each function, find the best split for each function, and, in the end, make a division into the function that created the best among all. In RandomForest, instead of iterating over each function to find the best split, you can try random subsampling at each step (the default is sqrt (n_features)).

So, every tree in RandomForest is suitable for a boot set of random workouts. And at each step of the branching, he looks only at a subsample of functions, so some branches will be good, but not necessarily ideal. This means that each tree is less ideal for the source data. When you average the result of all these (suboptimal) trees, you get a reliable prediction. Regular decisions. Recalculation of data, this two-way randomization (bags and subsampling) allows them to generalize and the forest usually does a good job.

Here's the catch: although you can average the output of each tree, you cannot "average trees" to get an "average tree" . Since trees are a bunch of if-then statements that are connected by a chain, there is no way to take these chains and come up with one chain that will produce a result that will be the same as the average result from each chain. Each tree in the forest is different, even if the same functions appear, they appear in different places of the trees, which makes unification impossible. You cannot represent RandomForest as a single tree.

There are two things you can do.

1) . As mentioned in RPresle, you can see the .feature_importances_ attribute, which for each function averages the partition estimate from different trees. The idea is that although you cannot get an average tree, you can quantify how and how effectively each function is used in the forest by averaging their scores in each tree.

2) . When I approach the RandomForest model and you need to get an idea of ​​what is happening, how functions influence the result, I also fit one DecisionTree. Now this model is usually not good on its own, it will be easily surpassed by RandomForest, and I would not use it to predict anything, but by drawing and considering the splits in this tree in combination with the .feature_importances_ forest, I usually get a pretty good idea about the big picture.

+1
source

All Articles