Pandas: .groupby (). size () and percent

I have a DataFrame that comes from the df.groupby().size() operation and looks like this:

 Localization RNA level cytoplasm 1 Non-expressed 7 2 Very low 13 3 Low 8 4 Medium 6 5 Moderate 8 6 High 2 7 Very high 6 cytoplasm & nucleus 1 Non-expressed 5 2 Very low 8 3 Low 2 4 Medium 10 5 Moderate 16 6 High 6 7 Very high 5 cytoplasm & nucleus & plasma membrane 1 Non-expressed 6 2 Very low 3 3 Low 3 4 Medium 7 5 Moderate 8 6 High 4 7 Very high 1 

What I want to do is compute the individual occurrences (i.e. the last column coming from .size() ) as a percentage of the total number of occurrences in the applicable Localization .

For example: in the cytoplasm location there are only 50 cases (7 + 13 + 8 + 6 + 8 + 2 + 6), which gives 14 and 26% for the Non-expressed and Very low RNA levels, respectively.

Is there a good way to do this? I went around this with the fact that, in my opinion, it’s very cool, i.e. creating a new DataFrame for each Localization and working from there, but there are a lot of rows and the problem of merging all the resulting DataFrames at the end. I hope there will be a smarter way to do this, at least!

+7
python pandas bioinformatics
source share
1 answer

Here is a complete example based on pandas groupby , sum . The basic idea is to group data based on 'Localization' and apply the function to the group.

 import pandas as pd from StringIO import StringIO #For Python 3: from io import StringIO data = \ """Localization,RNA level,Size cytoplasm ,1 Non-expressed, 7 cytoplasm ,2 Very low ,13 cytoplasm ,3 Low , 8 cytoplasm ,4 Medium , 6 cytoplasm ,5 Moderate , 8 cytoplasm ,6 High , 2 cytoplasm ,7 Very high , 6 cytoplasm & nucleus ,1 Non-expressed, 5 cytoplasm & nucleus ,2 Very low , 8 cytoplasm & nucleus ,3 Low , 2 cytoplasm & nucleus ,4 Medium ,10 cytoplasm & nucleus ,5 Moderate ,16 cytoplasm & nucleus ,6 High , 6 cytoplasm & nucleus ,7 Very high , 5 cytoplasm & nucleus & plasma membrane,1 Non-expressed, 6 cytoplasm & nucleus & plasma membrane,2 Very low , 3 cytoplasm & nucleus & plasma membrane,3 Low , 3 cytoplasm & nucleus & plasma membrane,4 Medium , 7 cytoplasm & nucleus & plasma membrane,5 Moderate , 8 cytoplasm & nucleus & plasma membrane,6 High , 4 cytoplasm & nucleus & plasma membrane,7 Very high , 1""" # Create the dataframe df = pd.read_csv(StringIO(data)) df['Localization'].str.strip() df['RNA level'].str.strip() df['Size'].astype(int) df['Percent'] = df.groupby('Localization')['Size'].transform(lambda x: x/sum(x)) 
+9
source share

All Articles