I am working on a regression problem and use both R randomForest and python score for sklearn random forest regression .
Package R can calculate the importance rating of a function in two different ways:
The first dimension is calculated from the permutation of the OOB data: for each tree, the prediction error in the part of the data outside the packet (error coefficient for classification, MSE for regression). Then the same thing is done after the permutation of each predictor variable. The difference between them is averaged over all trees and normalized to the standard deviation of the differences.
The second measure is a complete reduction of node impurities from splitting by a variable, averaged over all trees. For classification, the admixture node is measured by the Gini index. For regression, it is measured by the residual sum of squares (RSS).
While sklearn does this only in the latter way ( see details here ).
I was interested in comparing method # 2 in both implementations, so I did the following:
R
iteration_count <- 3
seeds <- seq(1,iteration_count,1)
tree_count <- 500
for(i in 1:iteration_count) {
set.seed(seeds[[i]])
rfmodels[[i]]<- randomForest(y ~ .,X,ntree=tree_count,importance=TRUE,na.action=na.omit)
}
imp_score_matrix <- do.call(cbind, lapply(models_selected, function(x) { importance(x, scale=T, type=1)[,1] }))
imp_score_stats <- (cbind(rowMeans(imp_score_matrix),rowSds(imp_score_matrix)))
ordered_imp_score_stats <- imp_score_stats[order(imp_score_stats[,1]),]
sklearn
num_iter = 3
trees = 500
seeds = [l for l in range(num_iter)]
FIS = []
num_features = 1/3.0
leaf = 5
FIS_map = {v:k for k,v in enumerate(X.columns.values)}
for i in range(num_iter):
print "Iteration", i
clf = RandomForestRegressor(n_jobs = -1, n_estimators = trees, random_state = seeds[i],
max_features = num_features, min_samples_leaf = leaf)
clf = clf.fit(X,y)
FIS.append(clf.feature_importances_)
FIS_stats = pd.DataFrame(FIS).describe().T
FIS_stats = FIS_stats.sort("mean", ascending = False)
FIS_stats['OTU'] = FIS_map
FIS_stats = FIS_stats.set_index('OTU')
FIS_stats = FIS_stats[FIS_stats['mean'] > 0]
As you can see, I tried to set the default parameters in sklearn to match those used in R. The problem is that I have different results for each implementation. Now I understand that in random forests there are various non-deterministic dimensions, so I do not expect the functions to be ranked in exactly the same way; however, I see almost no overlap of important functions.
, X, , R, , , , .
- ? ?
Update
, Gini sklearn, , MSE .
, , R RSS, sklearn MSE, :

?