How to get the names of objects selected by excluding functions in the sklearn pipeline?

Question

How to get the names of objects selected by excluding functions in the sklearn pipeline?

I use a recursive function in my sklearn pipeline, the pipeline looks something like this:

from sklearn.pipeline import FeatureUnion, Pipeline from sklearn import feature_selection from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC X = ['I am a sentence', 'an example'] Y = [1, 2] X_dev = ['another sentence'] # classifier LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001) f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1) pipeline = Pipeline([ ('features', FeatureUnion([ ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)), ('custom_features', CustomFeatures())])), ('rfe_feature_selection', f5), ('clf', LinearSVC1), ]) pipeline.fit(X, Y) y_pred = pipeline.predict(X_dev)

How can I get function names selected by RFE? The RFE should select the top 500 features, but I really need to take a look at which features were selected.

EDIT:

I have a complex Pipeline, which consists of several pipelines and functional associations, the choice of the percentile function and in the end. Recursive function:

 fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90) fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80) f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3) countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True) countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False) pipeline = Pipeline([ ('union', FeatureUnion( transformer_list=[ ('vectorized_pipeline', Pipeline([ ('union_vectorizer', FeatureUnion([ ('stem_text', Pipeline([ ('selector', ItemSelector(key='stem_text')), ('stem_tfidf', countVecWord) ])), ('pos_text', Pipeline([ ('selector', ItemSelector(key='pos_text')), ('pos_tfidf', countVecWord_tags) ])), ])), ('percentile_feature_selection', fs_vect) ])), ('custom_pipeline', Pipeline([ ('custom_features', FeatureUnion([ ('pos_cluster', Pipeline([ ('selector', ItemSelector(key='pos_text')), ('pos_cluster_inner', pos_cluster) ])), ('stylistic_features', Pipeline([ ('selector', ItemSelector(key='raw_text')), ('stylistic_features_inner', stylistic_features) ])), ])), ('percentile_feature_selection', fs), ('inner_scale', inner_scaler) ])), ], # weight components in FeatureUnion # n_jobs=6, transformer_weights={ 'vectorized_pipeline': 0.8, # 0.8, 'custom_pipeline': 1.0 # 1.0 }, )), ('rfe_feature_selection', f5), ('clf', classifier), ])

I will try to explain the steps. The first Pipeline consists of vectorizers and is called "vectorized_pipeline", all of them have the function "get_feature_names". The second Pipeline consists of my own functions, I implemented them with the fit, transform and get_feature_names functions. When I use the @Kevin clause, I get an error that "union" (which is the name of my top element in the pipeline) does not have the get_feature_names function:

 support = pipeline.named_steps['rfe_feature_selection'].support_ feature_names = pipeline.named_steps['union'].get_feature_names() print np.array(feature_names)[support]

Also, when I try to get function names from individual FeatureUnions, for example:

 support = pipeline.named_steps['rfe_feature_selection'].support_ feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names() print np.array(feature_names)[support]

I get a key error:

 feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names() KeyError: 'union_vectorizer'

+6

python scikit-learn machine-learning

ivan_bilan Apr 14 '16 at 20:34

source share

1 answer

Kevin · Accepted Answer · 2016-04-15T11:40:17+0000

You can access each Pipeline step with the named_steps attribute, here is an example aperture dataset that selects only 2 , but the solution will scale.

 from sklearn import datasets from sklearn import feature_selection from sklearn.svm import LinearSVC iris = datasets.load_iris() X = iris.data y = iris.target # classifier LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001) f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=2, step=1) pipeline = Pipeline([ ('rfe_feature_selection', f5), ('clf', LinearSVC1) ]) pipeline.fit(X, y)

With named_steps you can access the attributes and methods of the transform object in the pipeline. The RFE support_ (or the get_support() method) will return the Boolean mask of the selected functions:

 support = pipeline.named_steps['rfe_feature_selection'].support_

Now support is an array, you can use it to efficiently retrieve the name of your selected functions (columns). Make sure your function names are in the numpy array and not in the python list.

 import numpy as np feature_names = np.array(iris.feature_names) # transformed list to array feature_names[support] array(['sepal width (cm)', 'petal width (cm)'], dtype='|S17')

EDIT

In my comment above, here is your example with the CustomFeautures () function removed:

 from sklearn.pipeline import FeatureUnion, Pipeline from sklearn import feature_selection from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC import numpy as np X = ['I am a sentence', 'an example'] Y = [1, 2] X_dev = ['another sentence'] # classifier LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001) f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1) pipeline = Pipeline([ ('features', FeatureUnion([ ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])), ('rfe_feature_selection', f5), ('clf', LinearSVC1), ]) pipeline.fit(X, Y) y_pred = pipeline.predict(X_dev) support = pipeline.named_steps['rfe_feature_selection'].support_ feature_names = pipeline.named_steps['features'].get_feature_names() np.array(feature_names)[support]

How to get the names of objects selected by excluding functions in the sklearn pipeline?

EDIT

More articles: