Ignore column when building model using SKLearn

Using R, you can ignore a variable (column) when building a model with the following syntax:

model = lm(dependant.variable ~ . - ignored.variable, data=my.training,set) 

This is very convenient when your dataset contains indexes or an identifier.

How do you do this with SKlearn in python if your Pandas data is data frames?

+6
source share
1 answer

So this is from my own code that I used to predict last year's StackOverflow:

 from __future__ import division from pandas import * from sklearn import cross_validation from sklearn import metrics from sklearn.ensemble import GradientBoostingClassifier basic_feature_names = [ 'BodyLength' , 'NumTags' , 'OwnerUndeletedAnswerCountAtPostTime' , 'ReputationAtPostCreation' , 'TitleLength' , 'UserAge' ] fea = # extract the features - removed for brevity # construct our classifier clf = GradientBoostingClassifier(n_estimators=num_estimators, random_state=0) # now fit clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values) # now priv_fea = # this was my test dataset # now calculate the predicted classes pred = clf.predict(priv_fea[basic_feature_names]) 

So, if we wanted a subset of functions for classification, I could do this:

 # want to train using fewer features so remove 'BodyLength' basic_feature_names.remove('BodyLength') clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values) 

So the idea is that a list can be used to select a subset of columns in a pandas frame, so we can build a new list or delete a value and use it to select

I'm not sure how easy this can be done using numpy arrays, since indexing is done differently.

+7
source

All Articles