Features significance significance results differ R and Sklearn random forest regression

I am working on a regression problem and use both R randomForest and python score for sklearn random forest regression .

Package R can calculate the importance rating of a function in two different ways:

  • The first dimension is calculated from the permutation of the OOB data: for each tree, the prediction error in the part of the data outside the packet (error coefficient for classification, MSE for regression). Then the same thing is done after the permutation of each predictor variable. The difference between them is averaged over all trees and normalized to the standard deviation of the differences.

  • The second measure is a complete reduction of node impurities from splitting by a variable, averaged over all trees. For classification, the admixture node is measured by the Gini index. For regression, it is measured by the residual sum of squares (RSS).

While sklearn does this only in the latter way ( see details here ).

I was interested in comparing method # 2 in both implementations, so I did the following:

R

iteration_count <- 3
seeds <- seq(1,iteration_count,1)
tree_count <- 500

for(i in 1:iteration_count) {
  set.seed(seeds[[i]])
  rfmodels[[i]]<- randomForest(y ~ .,X,ntree=tree_count,importance=TRUE,na.action=na.omit)
}

# convert all iterations into matrix form
imp_score_matrix <- do.call(cbind, lapply(models_selected, function(x) { importance(x, scale=T, type=1)[,1] }))

# Calculate mean and s.d. for importance ranking of each feature based on a matrix of feature importance scores
imp_score_stats <- (cbind(rowMeans(imp_score_matrix),rowSds(imp_score_matrix)))

# Order the matrix so that the features are ranked by mean (most important features will be in the last rows)
ordered_imp_score_stats <- imp_score_stats[order(imp_score_stats[,1]),]

sklearn

# get FIS through mean decrease in impurity (default method for sklearn)
num_iter = 3 # number of times to generate FIS; will average over these scores
trees = 500
seeds = [l for l in range(num_iter)]
FIS = []

# R implementation of RF settings - https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
num_features = 1/3.0 # see mtry
leaf = 5 # see nodesize

FIS_map = {v:k for k,v in enumerate(X.columns.values)} # {feature: i}
for i in range(num_iter):
    print "Iteration", i
    clf = RandomForestRegressor(n_jobs = -1, n_estimators = trees, random_state = seeds[i],
                               max_features = num_features, min_samples_leaf = leaf)
    clf = clf.fit(X,y)
    FIS.append(clf.feature_importances_)

FIS_stats = pd.DataFrame(FIS).describe().T # will have columns mean, std, etc
FIS_stats = FIS_stats.sort("mean", ascending = False) # most important features on top
FIS_stats['OTU'] = FIS_map # add the OTU ID
FIS_stats = FIS_stats.set_index('OTU')
FIS_stats = FIS_stats[FIS_stats['mean'] > 0] # remove those OTU features with no mean importance 

As you can see, I tried to set the default parameters in sklearn to match those used in R. The problem is that I have different results for each implementation. Now I understand that in random forests there are various non-deterministic dimensions, so I do not expect the functions to be ranked in exactly the same way; however, I see almost no overlap of important functions.

, X, , R, , , , .

- ? ?

Update

, Gini sklearn, , MSE .

, , R RSS, sklearn MSE, :

enter image description here

?

+4

All Articles