How to iterate over pandas dataframe columns to trigger regression

I am sure it is simple, but as a complete newbie in python, it is difficult for me to figure out how to iterate over the variables in the pandas dataframe and run the regression with each.

That's what I'm doing:

 all_data = {} for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']: all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015') prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()}) returns = prices.pct_change() 

I know that I can run a regression as follows:

 regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit() 

but suppose I want to do this for each column in a dataframe. In particular, I want to regress FIUIX to FSTMX, and then FSAIX to FSTMX, and then FSAVX to FSTMX. After each regression, I want to keep the rest.

I have tried various versions of the following, but I should not get the syntax correctly:

 resids = {} for k in returns.keys(): reg = sm.OLS(returns[k],returns.FSTMX).fit() resids[k] = reg.resid 

I think the problem is that I don’t know how to access the return column by key, so returns[k] is probably incorrect.

Any advice on the best way to do this would be greatly appreciated. Perhaps there is a general pandas approach that I am missing.

+142
python pandas statsmodels
Jan 29 '15 at 15:42
source share
9 answers
 for column in df: print(df[column]) 
+258
Sep 14 '15 at 6:42
source share

You can use iteritems() :

 for name, values in df.iteritems(): print('{name}: {value}'.format(name=name, value=values[0])) 
+52
Apr 02 '16 at 11:31 on
source share

This answer is to iterate over the selected columns as well as all the columns in DF.

df.columns gives a list containing the names of all columns in DF. Now this is not very useful if you want to iterate over all columns. But this is convenient if you want to iterate over only the columns of your choice.

We can easily use Python list slicing to slice df.columns according to our needs. For example, to iterate over all columns except the first, we can do:

 for column in df.columns[1:]: print(df[column]) 

Similarly, to iterate over all columns in reverse order, we can do:

 for column in df.columns[::-1]: print(df[column]) 

We can iterate over all columns in many interesting ways using this technique. Also remember that you can easily get indexes of all columns using:

 for ind, column in enumerate(df.columns): print(ind, column) 
+23
Jul 29 '17 at 17:07
source share

You can index dataframe columns by position using ix .

 df1.ix[:,1] 

This returns, for example, the first column. (0 will be an index)

 df1.ix[0,] 

This returns the first row.

 df1.ix[:,1] 

This will be the value at the intersection of rows 0 and column 1:

 df1.ix[0,1] 

etc. That way you can enumerate() returns.keys(): and use the number to index the data frame.

+19
Jan 29 '15 at 15:51
source share

The workaround is to transpose the DataFrame and repeat along the lines.

 for column_name, column in df.transpose().iterrows(): print column_name 
+5
Jul 22 '15 at 17:40
source share

Using list comprehension, you can get all column names (heading):

[column for column in df]

+3
Mar 22 '17 at 22:38
source share

I'm a little late, but here's how I did it. Steps:

  • Create a list of all columns
  • Use itertools to accept x combinations
  • Add each R-square result to the resulting data frame along with the excluded column list
  • Sort the result of DF in descending order of the square R to see which one works best.

This is the code I used in a DataFrame called aft_tmt . Feel free to extrapolate your use case.

 import pandas as pd # setting options to print without truncating output pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) import statsmodels.formula.api as smf import itertools # This section gets the column names of the DF and removes some columns which I don't want to use as predictors. itercols = aft_tmt.columns.tolist() itercols.remove("sc97") itercols.remove("sc") itercols.remove("grc") itercols.remove("grc97") print itercols len(itercols) # results DF regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"]) # excluded cols exc = [] # change 9 to the number of columns you want to combine from N columns. #Possibly run an outer loop from 0 to N/2? for x in itertools.combinations(itercols, 9): lmstr = "+".join(x) m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt) f = m.fit() exc = [item for item in x if item not in itercols] regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"])) regression_res.sort_values(by="Rsq", ascending = False) 
+3
Apr 29 '17 at 3:37 on
source share

Based on the accepted answer , if an index corresponding to each column is also desirable:

 for i, column in enumerate(df): print i, df[column] 

The above type of df[column] is Series , which can simply be converted to numpy ndarray s:

 for i, column in enumerate(df): print i, np.asarray(df[column]) 
+3
Apr 23 '18 at 17:36
source share

To iterate over the contents of a data frame (and not its column names), you can use

 # df has 3 columns and 5 rows df = pd.DataFrame(np.random.randint(0, 10, (5, 3)), columns=['A','B','C']) for col in df.values: print(col) 

what conclusions

 [5 5 0] [7 4 5] [4 1 6] [2 3 4] [6 0 4] 

To iterate over a column, not a row, just df.values :

 for col in df.values.T: print(col) 
 [5 7 4 2 6] [5 4 1 3 0] [0 5 6 4 4] 
0
Jul 07 '19 at 17:57
source share



All Articles