Pandas: select multiple columns from one row

Question

Pandas: select multiple columns from one row

I have a script that does something for me, but is very inefficient. I asked for some help for code reviewers, and I was told to try instead of Pandas. This is what I did, but I can hardly understand how it works. I tried to read the documentation and other questions here, but I can not find the answer.

So, I have a dataframe with few rows (from 20 to several hundred) and fewer columns. I used the read_table Pandas function to get the source data in a .txt form that looks like this:

[ID1, Gene1, Sequence1, Ratio1, Ratio2, Ratio3] [ID1, Gene1, Sequence2, Ratio1, Ratio2, Ratio3] [ID2, Gene2, Sequence3, Ratio1, Ratio2, Ratio3] [ID2, Gene3, Sequence4, Ratio1, Ratio2, Ratio3] [ID3, Gene3, Sequence5, Ratio1, Ratio2, Ratio3]

... along with a number of unimportant columns.

What I want to do is select all the relationships from each sequence and do some calculations and statistics on them (all 3 relationships for each sequence). I tried

 df.groupby('Sequence') for col in df: do something / print(col) / print(col[0])

... but it only bothers me. If I pass print (col), I get some kind of df construct, whereas if I pass print (col [0]), I get only sequences. As far as I can see in the design, I should still have all the other columns and their data, since groupby () does not delete any data, it just groups them by some input column. What am I doing wrong?

Although I have not received this yet, due to the above problems, I also want my script to be able to select all the coefficients for each ID and perform the same calculations on them, but this time each ratio (i.e. Ratio1 for all rows ID1, same for Ratio2, etc.). And finally, do the same for each gene.

EDIT:

So, let's say I want to perform this calculation for each relationship in a row, and then take the median of the three resulting values:

 df[Value1] = spike[data['ID']] / float(data['Ratio 1]) * (10**-12) * (6.022*10**23) / (1*10**6) df[Value2] = spike[data['ID']] / float(data['Ratio 2]) * (10**-12) * (6.022*10**23) / (1*10**6) df[Value3] = spike[data['ID']] / float(data['Ratio 3]) * (10**-12) * (6.022*10**23) / (1*10**6)

... where spike is a dictionary and keys are identifiers. Ignoring the dict part, I can do the calculations (thanks!), But how do I access the dictionary using the dataframe identifiers? With the above code, I just get the message "Impossible type: series."

Here are some real data:

 ID Gene Sequence Ratio1 Ratio2 Ratio3 1 KRAS SFEDXXYR 15.822 14.119 14.488 2 KRAS VEDAXXXLVR 9.8455 8.9279 16.911 3 ELK4 IEXXXCESLNK 15.745 7.9122 9.5966 3 ELK4 IEGXXXSLNKR 1.177 NaN 12.073

0

python pandas

Sajber Jan 13 '14 at 10:36

source share

1 answer

joris · Accepted Answer · 2014-01-13T10:44:12+0000

df.groupby() does not change / group df in place. Therefore, you must assign the result to a new variable for future use. For example.
```
 grouped = df.groupby('Sequence') 
```
By the way, in the given example data, all the data in the Sequence column is unique, therefore, grouping by this column will not do much.
In addition, you usually do not need to "iterate over df" as you are here. To apply a function to all groups, you can do this directly by the result of groupby, for example df.groupby().apply(..) or df.groupby().aggregate(..) .
Can you give a more concrete example of what function you want to apply to relationships?
To calculate the median of three relationships for each sequence (each row), you can do:
```
 df[['Ratio1', 'Ratio2', 'Ratio3']].median(axis=1) 
```
axis=1 means that you do not want to take the median of one column (row by row), but for each row (by column)

Another example, to calculate the median of all Ratio1 for each identifier, you can do:

 df.groupby('ID')['Ratio1'].median()

Here you group the ID , select the Ratio1 column and calculate the median value for each group.

UPDATE: you should probably split the questions into separate ones, but as an answer to your new question:

data['ID'] will provide you with an ID column, so you cannot use it as a key. You want one specific value for this column. To apply a function to each row of a data frame, you can use apply :

 def my_func(row): return spike[row['ID']] / float(row['Ratio 1']) * (10**-12) * (6.022*10**23) / (1*10**6) df['Value1'] = df.apply(my_func, axis=1)

Pandas: select multiple columns from one row

More articles: