I work with a large biological dataset.
I want to calculate the PCC (Pearson correlation coefficient) of all 2 column combinations in my data table and save the result as a DataFrame or CSV file.
The data table looks like this: columns are the name of the genes, and rows are the code of the data set. Float numbers mean how much gene is activated in a dataset.
GeneA GeneB GeneC ... DataA 1.5 2.5 3.5 ... DataB 5.5 6.5 7.5 ... DataC 8.5 8.5 8.5 ... ...
As a result, I want to build a table (DataFrame or csv file), as shown below, because the scipy.stats.pearsonr function returns (PCC, p-value). In my example, XX and YY mean pearsonr results ([1.5, 5.5, 8.5], [2.5, 6.5, 8.5]). Similarly, ZZ and AA mean the result of pearsonr ([1.5, 5.5, 8.5], [3.5, 7.5, 8.5]). I don't need redundant data like GeneB_GeneA or GeneC_GeneB in my test.
PCC P-value GeneA_GeneB XX YY GeneA_GeneC ZZ AA GeneB_GeneC BB CC ...
Since the number of columns and rows is many (over 100), and their names are complex, using column names or row names will be difficult.
This can be a simple problem for experts, I donβt know how to handle such a table with the python and pandas library. It is especially important to make a new DataFrame and add a result that seems very complex.
Sorry for my poor explanation, but I hope someone can help me.