Calculation of pair correlation between all columns

Question

Calculation of pair correlation between all columns

I work with a large biological dataset.

I want to calculate the PCC (Pearson correlation coefficient) of all 2 column combinations in my data table and save the result as a DataFrame or CSV file.

The data table looks like this: columns are the name of the genes, and rows are the code of the data set. Float numbers mean how much gene is activated in a dataset.

GeneA GeneB GeneC ... DataA 1.5 2.5 3.5 ... DataB 5.5 6.5 7.5 ... DataC 8.5 8.5 8.5 ... ...

As a result, I want to build a table (DataFrame or csv file), as shown below, because the scipy.stats.pearsonr function returns (PCC, p-value). In my example, XX and YY mean pearsonr results ([1.5, 5.5, 8.5], [2.5, 6.5, 8.5]). Similarly, ZZ and AA mean the result of pearsonr ([1.5, 5.5, 8.5], [3.5, 7.5, 8.5]). I don't need redundant data like GeneB_GeneA or GeneC_GeneB in my test.

  PCC P-value GeneA_GeneB XX YY GeneA_GeneC ZZ AA GeneB_GeneC BB CC ...

Since the number of columns and rows is many (over 100), and their names are complex, using column names or row names will be difficult.

This can be a simple problem for experts, I don’t know how to handle such a table with the python and pandas library. It is especially important to make a new DataFrame and add a result that seems very complex.

Sorry for my poor explanation, but I hope someone can help me.

+8

python pandas correlation

z991 Nov 30 '15 at 11:39

source share

3 answers

To get pairs, this is a problem of combinations . You can concat to collect all rows into a single dataframe result.

 from pandas import * from itertools import combinations df = pandas.read_csv('gene.csv') # get the column names as list, which are gene names column_list = df.columns.values.tolist() result = [] for c in combinations(column_list, 2): firstGene, secondGene = c firstGeneData = df[firstGene].tolist() secondGeneData = df[secondGene].tolist() # now to get the PCC, P-value using scipy pcc = ... p-value = ... result.append(pandas.DataFrame([{'PCC': pcc, 'P-value': p-value}], index=str(firstGene)+ '_' + str(secondGene), columns=['PCC', 'P-value']) result_df = pandas.concat(result) #result_df.to_csv(...)

+2

chenzhongpu Nov 30 '15 at 14:40

source share

A simple solution is to use the pairwise_corr function of the Pingouin package (which I created):

 import pingouin as pg pg.pairwise_corr(data, method='pearson')

This will give you a DataFrame with all the combinations of columns, and for each of them - the r value, p value, sample size, and more.

There are also a number of options for specifying one or more columns (for example, one-to-all behavior), as well as covariates for partial correlation and various methods for calculating the correlation coefficient. Please see this Jupyter Notebook example for a more detailed demonstration.

0

Raphael Jul 13 '19 at 20:55

source share

Stefan · Accepted Answer · 2015-11-30T14:49:37+0000

 from pandas import * import numpy as np from libraries.settings import * from scipy.stats.stats import pearsonr import itertools

Creating random sample data:

 df = DataFrame(np.random.random((5, 5)), columns=['gene_' + chr(i + ord('a')) for i in range(5)]) print(df) gene_a gene_b gene_c gene_d gene_e 0 0.471257 0.854139 0.781204 0.678567 0.697993 1 0.292909 0.046159 0.250902 0.064004 0.307537 2 0.422265 0.646988 0.084983 0.822375 0.713397 3 0.113963 0.016122 0.227566 0.206324 0.792048 4 0.357331 0.980479 0.157124 0.560889 0.973161 correlations = {} columns = df.columns.tolist() for col_a, col_b in itertools.combinations(columns, 2): correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b]) result = DataFrame.from_dict(correlations, orient='index') result.columns = ['PCC', 'p-value'] print(result.sort_index()) PCC p-value gene_a__gene_b 0.461357 0.434142 gene_a__gene_c 0.177936 0.774646 gene_a__gene_d -0.854884 0.064896 gene_a__gene_e -0.155440 0.802887 gene_b__gene_c -0.575056 0.310455 gene_b__gene_d -0.097054 0.876621 gene_b__gene_e 0.061175 0.922159 gene_c__gene_d -0.633302 0.251381 gene_c__gene_e -0.771120 0.126836 gene_d__gene_e 0.531805 0.356315

Get unique DataFrame column combinations using itertools.combination(iterable, r)
Iterating through these combinations and calculating pair correlations using scipy.stats.stats.personr
Add results (PCC tuple and p-value) to dictionary
Building DataFrame from dictionary

Then you can also save result.to_csv() . It might be convenient to use MultiIndex (two columns containing the names of each column) instead of the created names for pair correlations.

Calculation of pair correlation between all columns

More articles: