Let's say that I want to compare different approaches to reducing the dimension for a certain (controlled) data set, which consists of n> 2 functions through cross-validation and using the pipeline class.
For example, if I want to experiment with the PCA against the LDA, I could do something like:
from sklearn.cross_validation import cross_val_score, KFold from sklearn.pipeline import Pipeline from sklearn.naive_bayes import GaussianNB from sklearn.preprocessing import StandardScaler from sklearn.lda import LDA from sklearn.decomposition import PCA clf_all = Pipeline(steps=[ ('scaler', StandardScaler()), ('classification', GaussianNB()) ]) clf_pca = Pipeline(steps=[ ('scaler', StandardScaler()), ('reduce_dim', PCA(n_components=2)), ('classification', GaussianNB()) ]) clf_lda = Pipeline(steps=[ ('scaler', StandardScaler()), ('reduce_dim', LDA(n_components=2)), ('classification', GaussianNB()) ])
But now let's say that - based on some "domain knowledge" - I have a hypothesis that functions 3 and 4 can be "good functions" (the third and fourth columns of the X_train array) and I want to compare them with other approaches.
How to enable such manual function selection in the pipeline ?
for example
def select_3_and_4(X_train): return X_train[:,2:4] clf_all = Pipeline(steps=[ ('scaler', StandardScaler()), ('feature_select', select_3_and_4), ('classification', GaussianNB()) ])
obviously won't work.
So, I assume that I need to create a function select class that has a transform dummy method and a fit method that returns two columns of a numpy array? Or is there a better way?
python scikit-learn
user2489252
source share