How to choose a model for my test suite in statsmodels (python)

Question

How to choose a model for my test suite in statsmodels (python)

I am working on a logistic regression model, and I have problems understanding how to get the model that matches my training set to my test suite. Sorry, I'm new to python and VERY new for statsmodels ..

import pandas as pd import statsmodels.api as sm from sklearn import cross_validation independent_vars = phy_train.columns[3:] X_train, X_test, y_train, y_test = cross_validation.train_test_split(phy_train[independent_vars], phy_train['target'], test_size=0.3, random_state=0) X_train = pd.DataFrame(X_train) X_train.columns = independent_vars X_test = pd.DataFrame(X_test) X_test.columns = independent_vars y_train = pd.DataFrame(y_train) y_train.columns = ['target'] y_test = pd.DataFrame(y_test) y_test.columns = ['target'] logit = sm.Logit(y_train,X_train[subset],missing='drop') result = logit.fit() print result.summary() y_pred = logit.predict(X_test[subset])

From the last line, I get this error:

y_pred = logit.predict (X_test [subset]) Traceback (last last call): File ", line 1, to File" C: \ Users \ eMachine \ WinPython-64bit-2.7.5.3 \ python-2.7.5.amd64 \ lib \ site-packages \ statsmodels \ discrete \ discrete_model.py ", line 378, in the prediction return self.cdf (np.dot (exog, params)) ValueError: matrices are not aligned

My dataset for training and testing has the same number of variables, so I'm sure I don't understand what logit.predict () does.

+6

python statsmodels

panterasBox Apr 13 '14 at 21:32

source share

1 answer

user333700 · Accepted Answer · 2014-04-13T23:06:47+0000

There are two forecasting methods.

logit in your example is the instance of the model . The model instance does not know the results of the evaluation. The model predicts a different signature, since it also needs the parameters logit.predict(params, exog) . It is mostly interesting for internal use.

What you want is a method for predicting an instance of results . In your example

y_pred = result.predict(X_test[subset])

should give the correct results. It uses the estimated parameters in the prediction with your new test data for the explanatory variables, X_test.

The call to model.fit() returns an instance of the result class, which provides access to additional statistics and post-assessment analysis, as well as forecasting.

How to choose a model for my test suite in statsmodels (python)

More articles: