Comprehensive feature selection in scikit-learn?

Is there a built-in way to select an enumeration function in scikit-learn? That is, perform an exhaustive assessment of all possible combinations of input functions, and then find the best subset. I am familiar with the Recursive Exclusion Function class, but I am particularly interested in evaluating all possible combinations of input functions one after another.

+9
source share
5 answers

No, the best selection of a subset is not implemented. The easiest way to do this is to write it yourself. This should start:

from itertools import chain, combinations from sklearn.cross_validation import cross_val_score def best_subset_cv(estimator, X, y, cv=3): n_features = X.shape[1] subsets = chain.from_iterable(combinations(xrange(k), k + 1) for k in xrange(n_features)) best_score = -np.inf best_subset = None for subset in subsets: score = cross_val_score(estimator, X[:, subset], y, cv=cv).mean() if score > best_score: best_score, best_subset = score, subset return best_subset, best_score 

This performs k-fold cross-validity inside the loop, so it will correspond to k 2 estimates when transferring data with functions p.

+6
source

By combining the answers of Fred Fu and the comments nopper, ihadanny and jimijazz, the following code gets the same results as the function R regsubsets () (part of the jump library) for the first example in Lab 1 (6.5.1 Best Subset) Selection in the book " Introduction to statistical training with applications in R ".

 from itertools import combinations from sklearn.cross_validation import cross_val_score def best_subset(estimator, X, y, max_size=8, cv=5): '''Calculates the best model of up to max_size features of X. estimator must have a fit and score functions. X must be a DataFrame.''' n_features = X.shape[1] subsets = (combinations(range(n_features), k + 1) for k in range(min(n_features, max_size))) best_size_subset = [] for subsets_k in subsets: # for each list of subsets of the same size best_score = -np.inf best_subset = None for subset in subsets_k: # for each subset estimator.fit(X.iloc[:, list(subset)], y) # get the subset with the best score among subsets of the same size score = estimator.score(X.iloc[:, list(subset)], y) if score > best_score: best_score, best_subset = score, subset # to compare subsets of different sizes we must use CV # first store the best subset of each size best_size_subset.append(best_subset) # compare best subsets of each size best_score = -np.inf best_subset = None list_scores = [] for subset in best_size_subset: score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean() list_scores.append(score) if score > best_score: best_score, best_subset = score, subset return best_subset, best_score, best_size_subset, list_scores 

See the Notebook at http://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_and_Regularization.ipynb#6.5.1-Best-Subset-Selection

+1
source

If you run this code in Python 3 xrange() , it has been renamed to range() .

0
source
0
source

You might want to take a look at the MLxtend Exhaustive Feature Selector . Obviously, it is not built into scikit-learn (yet?), But it supports its classifier and regressor objects.

0
source

All Articles