Ignoring missing values ​​in OLS multiple regression using statsmodels

I am trying to run OLS multiple regression using statsmodels and pandas dataframe. There are no values ​​in different columns for different rows, and I get an error: ValueError: array should not contain infs or NaNs I saw this SO question, similar, but definitely not answering my question: statsmodel.api.Logit: valueerror array must not contain infs or nans

What I would like to do is run a regression and ignore all rows where there are no variables for the variables that I use in this regression. Right now I have:

import pandas as pd import numpy as np import statsmodels.formula.api as sm df = pd.read_csv('cl_030314.csv') results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df).fit() 

I need something like missing = "drop". Any suggestions would be appreciated. Thank you very much.

+11
source share
2 answers

You answered your question. Just go through

 missing = 'drop' 

to ols

 import statsmodels.formula.api as smf ... results = smf.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df, missing='drop').fit() 

If this does not work, then this is a mistake and report it using MWE on github.

FYI, pay attention to the import above. Not everything is available in the formula.api namespace, so you should keep it separate from statsmodels.api. Or just use

 import statsmodels.api as sm sm.formula.ols(...) 
+19
source

The answer from jseabold works very well, but this may not be enough if you want to perform some calculations based on predicted and true values, for example, if you want to use the mean_squared_error function. In this case, it might be better to definitely get rid of NaN

 df = pd.read_csv('cl_030314.csv') df_cleaned = df.dropna() results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df_cleaned).fit() 
0
source

All Articles