I have a script that does something for me, but is very inefficient. I asked for some help for code reviewers, and I was told to try instead of Pandas. This is what I did, but I can hardly understand how it works. I tried to read the documentation and other questions here, but I can not find the answer.
So, I have a dataframe with few rows (from 20 to several hundred) and fewer columns. I used the read_table Pandas function to get the source data in a .txt form that looks like this:
[ID1, Gene1, Sequence1, Ratio1, Ratio2, Ratio3] [ID1, Gene1, Sequence2, Ratio1, Ratio2, Ratio3] [ID2, Gene2, Sequence3, Ratio1, Ratio2, Ratio3] [ID2, Gene3, Sequence4, Ratio1, Ratio2, Ratio3] [ID3, Gene3, Sequence5, Ratio1, Ratio2, Ratio3]
... along with a number of unimportant columns.
What I want to do is select all the relationships from each sequence and do some calculations and statistics on them (all 3 relationships for each sequence). I tried
df.groupby('Sequence') for col in df: do something / print(col) / print(col[0])
... but it only bothers me. If I pass print (col), I get some kind of df construct, whereas if I pass print (col [0]), I get only sequences. As far as I can see in the design, I should still have all the other columns and their data, since groupby () does not delete any data, it just groups them by some input column. What am I doing wrong?
Although I have not received this yet, due to the above problems, I also want my script to be able to select all the coefficients for each ID and perform the same calculations on them, but this time each ratio (i.e. Ratio1 for all rows ID1, same for Ratio2, etc.). And finally, do the same for each gene.
EDIT:
So, let's say I want to perform this calculation for each relationship in a row, and then take the median of the three resulting values:
df[Value1] = spike[data['ID']] / float(data['Ratio 1]) * (10**-12) * (6.022*10**23) / (1*10**6) df[Value2] = spike[data['ID']] / float(data['Ratio 2]) * (10**-12) * (6.022*10**23) / (1*10**6) df[Value3] = spike[data['ID']] / float(data['Ratio 3]) * (10**-12) * (6.022*10**23) / (1*10**6)
... where spike is a dictionary and keys are identifiers. Ignoring the dict part, I can do the calculations (thanks!), But how do I access the dictionary using the dataframe identifiers? With the above code, I just get the message "Impossible type: series."
Here are some real data:
ID Gene Sequence Ratio1 Ratio2 Ratio3 1 KRAS SFEDXXYR 15.822 14.119 14.488 2 KRAS VEDAXXXLVR 9.8455 8.9279 16.911 3 ELK4 IEXXXCESLNK 15.745 7.9122 9.5966 3 ELK4 IEGXXXSLNKR 1.177 NaN 12.073