How to pre-process new instances for classification so that the encoding of objects is the same as the model with Scikit-learn?

Question

How to pre-process new instances for classification so that the encoding of objects is the same as the model with Scikit-learn?

I create models using the classification of several classes for data, which has 6 functions. I preprocess the data using the code below using LabelEncoder.

#Encodes the data for each column. def pre_process_data(self): self.encode_column('feedback_rating') self.encode_column('location') self.encode_column('condition_id') self.encode_column('auction_length') self.encode_column('model') self.encode_column('gb') #Gets the column using the column name, transforms the column data and resets #the column def encode_column(self, name): le = preprocessing.LabelEncoder() current_column = np.array(self.X_df[name]).tolist() self.X_df[name] = le.fit_transform(current_column)

When I want to predict a new instance, I need to convert the data of the new instance so that the functions correspond to the same encoding as in the model. Is there an easy way to achieve this?

Also, if I want to save the model and get it, is there an easy way to save the encoding format so that I can use it to convert new instances to the extracted model?

+5

python scikit-learn machine-learning

Rich gray Mar 20 '15 at 16:43

source share

1 answer

AGS · Accepted Answer · 2015-03-21T20:22:45+0000

When I want to predict a new instance, I need to convert the data of the new instance so that the functions correspond to the same encoding as in the model. Is there an easy way to achieve this?

If you are not completely sure how your "pipeline" of your classification works, but you can simply use the fit LabelEncoder method for some new data - le convert the new data if the labels correspond to existing training sets.

 from sklearn import preprocessing le = preprocessing.LabelEncoder() # training data train_x = [0,1,2,6,'true','false'] le.fit_transform(train_x) # array([0, 1, 1, 2, 4, 3]) # transform some new data new_x = [0,0,0,2,2,2,'false'] le.transform(new_x) # array([0, 0, 0, 1, 1, 1, 3]) # transform data with a new feature bad_x = [0,2,6,'new_word'] le.transform(bad_x) # ValueError: y contains new labels: ['0' 'new_word']

Also, if I want to save the model and get it, is there an easy way to save the encoding format so that I can use it to convert new instances to the extracted model?

You can save models / parts of your models as follows:

 import cPickle as pickle from sklearn.externals import joblib from sklearn import preprocessing le = preprocessing.LabelEncoder() train_x = [0,1,2,6,'true','false'] le.fit_transform(train_x) # Save your encoding joblib.dump(le, '/path/to/save/model') # OR pickle.dump(le, open( '/path/to/model', "wb" ) ) # Load those encodings le = joblib.load('/path/to/save/model') # OR le = pickle.load( open( '/path/to/model', "rb" ) ) # Then use as normal new_x = [0,0,0,2,2,2,'false'] le.transform(new_x) # array([0, 0, 0, 1, 1, 1, 3])

How to pre-process new instances for classification so that the encoding of objects is the same as the model with Scikit-learn?

More articles: