How do you calculate the correlation between all columns in a DataFrame and all columns in another DataFrame?

Question

How do you calculate the correlation between all columns in a DataFrame and all columns in another DataFrame?

I have a DataFrame stocks object filled with stock returns. I have another DataFrame industries object populated with industry declarations. I want to find every stock correlation with every industry.

An expensive way to do this is to combine two DataFrame objects, calculate the correlation, and then throw all the stocks into stocks and industries into industry correlations. Is there a more efficient way to do this?

Thanks!

+7

python python-3.x pandas

Deets mcgeets Mar 08 '15 at 21:24

source share

3 answers

And here is a single line that uses apply in columns and avoids nested loops. The main advantage is that apply creates the result in a DataFrame.

 df1.apply(lambda s: df2.corrwith(s))

+10

Yt Mar 30 '16 at 7:27

source share

Here is a slightly simpler answer than JohnE, which uses pandas instead of numpy.corrcoef instead. As an added bonus, you do not need to extract the correlation value from the silly 2x2 correlation matrix, since the pandas series correlation function simply returns a number, not a matrix.

 In [133]: for s in ['s1','s2']: ...: for i in ['i1','i2']: ...: print df1[s].corr(df2[i])

+6

failwhale Aug 30 '15 at 17:21

source share

John · Accepted Answer · 2015-03-09T02:04:00+0000

( Edit to add . Instead of this answer, please check @yt's answer, which was added later, but clearly better.)

You can go with numpy.corrcoef() , which basically matches corr in pandas, but the syntax may be more amenable to what you want.

 import numpy as np np.random.seed(123) df1=pd.DataFrame( {'s1':np.random.randn(10000), 's2':np.random.randn(10000) } ) df2=pd.DataFrame( {'i1':np.random.randn(10000), 'i2':np.random.randn(10000) } ) for s in ['s1','s2']: for i in ['i1','i2']: print( 'corrcoef',s,i,np.corrcoef(df1[s],df2[i])[0,1] )

What prints:

 corrcoef s1 i1 -0.00416977553597 corrcoef s1 i2 -0.0096393047035 corrcoef s2 i1 -0.026278689352 corrcoef s2 i2 -0.00402030582064

Alternatively, you can upload the results to the framework with the appropriate labels:

 cc = pd.DataFrame() for s in ['s1','s2']: for i in ['i1','i2']: cc = cc.append( pd.DataFrame( { 'corrcoef':np.corrcoef(df1[s],df2[i])[0,1] }, index=[s+'_'+i]))

Which looks like this:

  corrcoef s1_i1 -0.004170 s1_i2 -0.009639 s2_i1 -0.026279 s2_i2 -0.004020

How do you calculate the correlation between all columns in a DataFrame and all columns in another DataFrame?

More articles: