Comprehensive feature selection in scikit-learn?

Question

Comprehensive feature selection in scikit-learn?

Is there a built-in way to select an enumeration function in scikit-learn? That is, perform an exhaustive assessment of all possible combinations of input functions, and then find the best subset. I am familiar with the Recursive Exclusion Function class, but I am particularly interested in evaluating all possible combinations of input functions one after another.

+9

scikit-learn

Dov Apr 9 '14 at 8:31

source share

5 answers

By combining the answers of Fred Fu and the comments nopper, ihadanny and jimijazz, the following code gets the same results as the function R regsubsets () (part of the jump library) for the first example in Lab 1 (6.5.1 Best Subset) Selection in the book " Introduction to statistical training with applications in R ".

 from itertools import combinations from sklearn.cross_validation import cross_val_score def best_subset(estimator, X, y, max_size=8, cv=5): '''Calculates the best model of up to max_size features of X. estimator must have a fit and score functions. X must be a DataFrame.''' n_features = X.shape[1] subsets = (combinations(range(n_features), k + 1) for k in range(min(n_features, max_size))) best_size_subset = [] for subsets_k in subsets: # for each list of subsets of the same size best_score = -np.inf best_subset = None for subset in subsets_k: # for each subset estimator.fit(X.iloc[:, list(subset)], y) # get the subset with the best score among subsets of the same size score = estimator.score(X.iloc[:, list(subset)], y) if score > best_score: best_score, best_subset = score, subset # to compare subsets of different sizes we must use CV # first store the best subset of each size best_size_subset.append(best_subset) # compare best subsets of each size best_score = -np.inf best_subset = None list_scores = [] for subset in best_size_subset: score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean() list_scores.append(score) if score > best_score: best_score, best_subset = score, subset return best_subset, best_score, best_size_subset, list_scores

See the Notebook at http://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_and_Regularization.ipynb#6.5.1-Best-Subset-Selection

+1

Pedro villanueva Jun 05 '18 at 15:50

source share

If you run this code in Python 3 xrange() , it has been renamed to range() .

0

Suleiman Dec 13 '17 at 5:07

source share

mlxtend library has an exhaustive selection of functions

0

rmwenzel Feb 02 '19 at 23:19

source share

You might want to take a look at the MLxtend Exhaustive Feature Selector . Obviously, it is not built into scikit-learn (yet?), But it supports its classifier and regressor objects.

0

jorijnsmit Oct 15 '19 at 8:07

source share

Fred foo · Accepted Answer · 2014-04-10T09:22:35+0000

No, the best selection of a subset is not implemented. The easiest way to do this is to write it yourself. This should start:

from itertools import chain, combinations from sklearn.cross_validation import cross_val_score def best_subset_cv(estimator, X, y, cv=3): n_features = X.shape[1] subsets = chain.from_iterable(combinations(xrange(k), k + 1) for k in xrange(n_features)) best_score = -np.inf best_subset = None for subset in subsets: score = cross_val_score(estimator, X[:, subset], y, cv=cv).mean() if score > best_score: best_score, best_subset = score, subset return best_subset, best_score

This performs k-fold cross-validity inside the loop, so it will correspond to k 2 estimates when transferring data with functions p.

Comprehensive feature selection in scikit-learn?

More articles: