I use scikit-learn python module to create a decision tree and work like a charm. I would like to achieve one more thing. So that the tree is divided by attribute only once.
The reason for this is due to my very strange dataset. I use a noisy dataset and I'm really interested in noise. My class results are binary, say [+, -]. I have a bunch of attributes with numbers mostly in the range (0,1).
When scikit-learn creates a tree, it splits into attributes several times to make the tree "better." I understand that in this way the leaf nodes become cleaner, but this is not the case that I would like to achieve.
The thing I did was to determine the cutoffs for each attribute by calculating the gain of the information in different slices and choosing a maximum. Thus, using the cross-validation methods "leave-one-out" and "1 / 3-2 / 3" I get better results than the original tree.
The problem is that when I try to automate this, I run into a problem around the lower and upper bounds, for example. about 0 and 1, because most of the elements will be under / top, and I get a really high informational gain, because one of the sets is clean, even if it contains only 1-2% of the total data.
In general, I would like to do something to make scikit-learn only split the attribute once.
If this is not possible, do you guys have any tips on how to generate these cuts in a beautiful way?
Thanks a lot!
source share