Python 2.7 - statsmodels - formatting and writing summary output

Question

Python 2.7 - statsmodels - formatting and writing summary output

I am performing logistic regression using pandas 0.11.0 (data processing) and statsmodels 0.4.3 for actual regression on Mac OSX Lion.

I am going to launch ~ 2900 different logistic regression models and I want the results to be output to a csv file and formatted in a certain way.

Currently, I only know about doing print result.summary() , which prints the results (as indicated below) in the shell:

  Logit Regression Results ============================================================================== Dep. Variable: death_death No. Observations: 9752 Model: Logit Df Residuals: 9747 Method: MLE Df Model: 4 Date: Wed, 22 May 2013 Pseudo R-squ.: -0.02672 Time: 22:15:05 Log-Likelihood: -5806.9 converged: True LL-Null: -5655.8 LLR p-value: 1.000 =============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------- age_age5064 -0.1999 0.055 -3.619 0.000 -0.308 -0.092 age_age6574 -0.2553 0.053 -4.847 0.000 -0.359 -0.152 sex_female -0.2515 0.044 -5.765 0.000 -0.337 -0.166 stage_early -0.1838 0.041 -4.528 0.000 -0.263 -0.104 access -0.0102 0.001 -16.381 0.000 -0.011 -0.009 ===============================================================================

I will also need a chance factor, which is calculated by print np.exp(result.params) and printed in the shell as such:

 age_age5064 0.818842 age_age6574 0.774648 sex_female 0.777667 stage_early 0.832098 access 0.989859 dtype: float64

I need each of them to be written to the csv file as a very long line (I'm not sure at the moment whether I need things like Log-Likelihood , but included them for the sake of thoroughness):

 `Log-Likelihood, age_age5064_coef, age_age5064_std_err, age_age5064_z, age_age5064_p>|z|,...age_age6574_coef, age_age6574_std_err, ......access_coef, access_std_err, ....age_age5064_odds_ratio, age_age6574_odds_ratio, ...sex_female_odds_ratio,.....access_odds_ratio`

I think you get an image - a very long line with all these actual values and a heading with all the column designations in a similar format.

I am familiar with csv module in Python and am familiar with pandas . I'm not sure that this information can be formatted and saved in a pandas dataframe , and then written to_csv to a file after all logistic regression models totaling ~ 2900 are completed; it will certainly be good. In addition, recording them at the end of each model is also fine (using the csv module ).

UPDATE:

So, I looked more at the statsmodels website, in particular, trying to figure out how the model results are stored in classes. There seems to be a class called Results that will need to be used. I think that using the inheritance of this class to create another class, where some of the methods / operators can be changed, may be a way to get the formatting you need. I have very little experience with how to do this, and you will have to spend a lot of time understanding this (which is good). If someone can help / have more experience, that would be awesome!

Here is the site where the classes are laid out: statsmodels result class

+8

python python-2.7 pandas statsmodels

DMML May 23 '13 at 4:19

source share

3 answers

If you want to find the coefficient results.params will give you the odds. If you want to find pvalues, use results.pvalues. In any case, you can use dir (results) to find out the entire attribute of the object.

+2

Atendra Jul 07 '14 at 14:09

source share

I found this wording a bit simpler. You can add / subtract columns following the syntax from the examples (pvals, coeff, conf_lower, conf_higher).

 import pandas as pd #This can be left out if already present... def results_summary_to_dataframe(results): '''This takes the result of an statsmodel results table and transforms it into a dataframe''' pvals = results.pvalues coeff = results.params conf_lower = results.conf_int()[0] conf_higher = results.conf_int()[1] results_df = pd.DataFrame({"pvals":pvals, "coeff":coeff, "conf_lower":conf_lower, "conf_higher":conf_higher }) #Reordering... results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]] return results_df

0

Afflatus Apr 21 '16 at 1:06

source share

user333700 · Accepted Answer · 2013-05-25T23:58:55+0000

Currently there is no ready-made table of parameters and their statistics of results.

Essentially, you need to collect all the results yourself, whether in a list, a numpy array or a pandas DataFrame, depending on what is more convenient for you.

for example, if I need one numpy array that has results for the model, llf and results in the summary table, then I could use

 res_all = [] for res in results: low, upp = res.confint().T # unpack columns res_all.append(numpy.concatenate(([res.llf], res.params, res.tvalues, res.pvalues, low, upp)))

But it is best to align with pandas, depending on the structure you have in your models.

You can write a helper function that takes all the results from an instance of the results and combines them into a string.

(I'm not sure what is most convenient for writing to csv line by line)

edit:

Here is an example of storing regression results in a data frame

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21

the loop is on line 159.

summary () and similar code outside statsmodels, for example http://johnbeieler.org/py_apsrtable/ for combining multiple results, is focused on printing rather than storing variables.

Python 2.7 - statsmodels - formatting and writing summary output

More articles: