Confused about random_state in scikit learn decision tree

Question

Confused about random_state in scikit learn decision tree

The random_state parameter, not sure if learning in the decision tree requires some randomness. My thoughts are (1) are they related to a random forest? (2) is this related to the data set for testing separation? If so, why not use the split learning method directly ( http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html )?

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

 >>> from sklearn.datasets import load_iris >>> from sklearn.cross_validation import cross_val_score >>> from sklearn.tree import DecisionTreeClassifier >>> clf = DecisionTreeClassifier(random_state=0) >>> iris = load_iris() >>> cross_val_score(clf, iris.data, iris.target, cv=10) ... ... array([ 1. , 0.93..., 0.86..., 0.93..., 0.93..., 0.93..., 0.93..., 1. , 0.93..., 1. ])

Regards, Lin

+7

python python-2.7 scikit-learn machine-learning decision-tree

Lin ma Aug 26 '16 at 3:48

source share

1 answer

Ami tavory · Accepted Answer · 2016-08-26T05:26:06+0000

This is explained in the documentation.

The problem of studying the optimal decision tree, as you know, is NP-complete in several aspects of optimality and even for simple concepts. Therefore, practical decision tree learning algorithms are based on heuristic algorithms such as a greedy algorithm in which locally optimal decisions are made on each node. Such algorithms cannot guarantee the return of a globally optimal decision tree. This can be mitigated by teaching several trees to the student in the ensemble, where functions and samples are selectively selected with replacement.

Thus, basically, the non-optimal greedy algorithm is repeated several times using random samples of features and samples (a similar method used in random forests). The random_state parameter allows random_state to control these random selections.

The documentation says:

If int, random_state is the seed used by the random number generator; If the instance is RandomState, random_state is a random number generator; If None, the random number generator is an instance of RandomState used by np.random.

So a random algorithm will be used anyway. Passing any value (be it a specific int, for example, 0 or an instance of RandomState ) will not change that value. The only rationale for passing an int value (0 or otherwise) is to make the result consistent between calls: if you call it with random_state=0 (or any other value), then you will get the same result every time.

Confused about random_state in scikit learn decision tree

More articles: