Ignore column when building model using SKLearn

Question

Ignore column when building model using SKLearn

Using R, you can ignore a variable (column) when building a model with the following syntax:

model = lm(dependant.variable ~ . - ignored.variable, data=my.training,set)

This is very convenient when your dataset contains indexes or an identifier.

How do you do this with SKlearn in python if your Pandas data is data frames?

+6

python scikit-learn

Matt May 01 '14 at 10:13

source share

1 answer

Edchum · Accepted Answer · 2014-05-01T12:06:34+0000

So this is from my own code that I used to predict last year's StackOverflow:

 from __future__ import division from pandas import * from sklearn import cross_validation from sklearn import metrics from sklearn.ensemble import GradientBoostingClassifier basic_feature_names = [ 'BodyLength' , 'NumTags' , 'OwnerUndeletedAnswerCountAtPostTime' , 'ReputationAtPostCreation' , 'TitleLength' , 'UserAge' ] fea = # extract the features - removed for brevity # construct our classifier clf = GradientBoostingClassifier(n_estimators=num_estimators, random_state=0) # now fit clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values) # now priv_fea = # this was my test dataset # now calculate the predicted classes pred = clf.predict(priv_fea[basic_feature_names])

So, if we wanted a subset of functions for classification, I could do this:

 # want to train using fewer features so remove 'BodyLength' basic_feature_names.remove('BodyLength') clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)

So the idea is that a list can be used to select a subset of columns in a pandas frame, so we can build a new list or delete a value and use it to select

I'm not sure how easy this can be done using numpy arrays, since indexing is done differently.

Ignore column when building model using SKLearn

More articles: