How can I use the custom select function in scikit-learn `pip`

Let's say that I want to compare different approaches to reducing the dimension for a certain (controlled) data set, which consists of n> 2 functions through cross-validation and using the pipeline class.

For example, if I want to experiment with the PCA against the LDA, I could do something like:

 from sklearn.cross_validation import cross_val_score, KFold from sklearn.pipeline import Pipeline from sklearn.naive_bayes import GaussianNB from sklearn.preprocessing import StandardScaler from sklearn.lda import LDA from sklearn.decomposition import PCA clf_all = Pipeline(steps=[ ('scaler', StandardScaler()), ('classification', GaussianNB()) ]) clf_pca = Pipeline(steps=[ ('scaler', StandardScaler()), ('reduce_dim', PCA(n_components=2)), ('classification', GaussianNB()) ]) clf_lda = Pipeline(steps=[ ('scaler', StandardScaler()), ('reduce_dim', LDA(n_components=2)), ('classification', GaussianNB()) ]) # Constructing the k-fold cross validation iterator (k=10) cv = KFold(n=X_train.shape[0], # total number of samples n_folds=10, # number of folds the dataset is divided into shuffle=True, random_state=123) scores = [ cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy') for clf in [clf_all, clf_pca, clf_lda] ] 

But now let's say that - based on some "domain knowledge" - I have a hypothesis that functions 3 and 4 can be "good functions" (the third and fourth columns of the X_train array) and I want to compare them with other approaches.

How to enable such manual function selection in the pipeline ?

for example

 def select_3_and_4(X_train): return X_train[:,2:4] clf_all = Pipeline(steps=[ ('scaler', StandardScaler()), ('feature_select', select_3_and_4), ('classification', GaussianNB()) ]) 

obviously won't work.

So, I assume that I need to create a function select class that has a transform dummy method and a fit method that returns two columns of a numpy array? Or is there a better way?

+9
python scikit-learn
source share
4 answers

If you want to use the Pipeline object, then yes, the clean way is to write a transformer object. The dirty way to do it is

 select_3_and_4.transform = select_3_and_4.__call__ select_3_and_4.fit = lambda x: select_3_and_4 

and use select_3_and_4 as it was in your pipeline. You can also write a class.

Otherwise, you can also simply pass X_train[:, 2:4] to your pipeline if you know that other functions do not matter.

Data-based feature selection tools may be off topic, but always helpful: check, for example. sklearn.feature_selection.SelectKBest using sklearn.feature_selection.f_classif or sklearn.feature_selection.f_regression e.g. k=2 in your case.

+3
source share

I just want to post my solution for completeness, and maybe it is useful for one or the other:

 class ColumnExtractor(object): def transform(self, X): cols = X[:,2:4] # column 3 and 4 are "extracted" return cols def fit(self, X, y=None): return self 

It can then be used in Pipeline as follows:

 clf = Pipeline(steps=[ ('scaler', StandardScaler()), ('reduce_dim', ColumnExtractor()), ('classification', GaussianNB()) ]) 

EDIT: General Solution

And for a more general solution, if you want to select and stack multiple columns, you can basically use the following class as follows:

 import numpy as np class ColumnExtractor(object): def __init__(self, cols): self.cols = cols def transform(self, X): col_list = [] for c in self.cols: col_list.append(X[:, c:c+1]) return np.concatenate(col_list, axis=1) def fit(self, X, y=None): return self clf = Pipeline(steps=[ ('scaler', StandardScaler()), ('dim_red', ColumnExtractor(cols=(1,3))), # selects the second and 4th column ('classification', GaussianNB()) ]) 
+26
source share

Adding the answers of Sebastian Rask and Eienberg, the requirements that the transformer object must contain are indicated in the scikit-learn documentation .

There are a few more requirements than just adapting and transforming if you want the evaluator to be used when evaluating parameters, for example, to implement set_params.

+5
source share

I did not find the accepted answer very clear, so here is my solution for others. In essence, the idea is to create a new class based on BaseEstimator and TransformerMixin

The following is a function selector based on the percentage of NA inside the column. The perc value corresponds to the percentage of NA.

 from sklearn.base import TransformerMixin, BaseEstimator class NonNAselector(BaseEstimator, TransformerMixin): """Extract columns with less than x percentage NA to impute further in the line Class to use in the pipline ----- attributes fit : identify columns - in the training set transform : only use those columns """ def __init__(self, perc=0.1): self.perc = perc self.columns_with_less_than_x_na_id = None def fit(self, X, y=None): self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist() return self def transform(self, X, y=None, **kwargs): return X[self.columns_with_less_than_x_na_id] def get_params(self, deep=False): return {"perc": self.perc} 
0
source share

All Articles