Attributes / Information Contained in DataFrame Column Names

I have some data imported from csv, to create something like this I used this:

data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High']) means = data.groupby(['split','sex']).mean() 

therefore, the data structure looks something like this:

  group0Low group0High group1Low group1High split sex 0 0 2 3 4 5 1 2 3 4 5 1 0 2 3 4 5 1 2 3 4 5 

You will notice that each column contains two variables (group # and height). (He was set up this way to run repeated anova measurements in SPSS.)

I want to break the columns up, so I can also group the β€œgroup” like this (I really sharpened the order of the numbers, but hopefully the idea is clear):

  low high split sex group 0 0 95 265 0 0 1 123 54 1 0 120 220 1 1 98 111 1 0 0 150 190 0 1 211 300 1 0 139 86 1 1 132 250 

How do I achieve this?

+2
source share
2 answers

The first trick is to assemble the columns into a single column using stack :

 In [6]: means Out[6]: group0Low group0High group1Low group1High split sex 0 0 2 3 4 5 1 2 3 4 5 1 0 2 3 4 5 1 2 3 4 5 In [13]: stacked = means.stack().reset_index(level=2) In [14]: stacked.columns = ['group_level', 'mean'] In [15]: stacked.head(2) Out[15]: group_level mean split sex 0 0 group0Low 2 0 group0High 3 

Now we can perform any string operations that we want on group_level using pd.Series.str as follows:

 In [18]: stacked['group'] = stacked.group_level.str[:6] In [21]: stacked['level'] = stacked.group_level.str[6:] In [22]: stacked.head(2) Out[22]: group_level mean group level split sex 0 0 group0Low 2 group0 Low 0 group0High 3 group0 High 

Now you are in business, and you can do whatever you want. For example, summarize each group / level:

 In [31]: stacked.groupby(['group', 'level']).sum() Out[31]: mean group level group0 High 12 Low 8 group1 High 20 Low 16 

How do I group everything?

If you want to group split , sex , group and level , you can do:

 In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4) Out[113]: mean split sex group level 0 0 group0 High 3 Low 2 group1 0High 5 0Low 4 

What if the split is not always at location 6?

This SO answer will show you how to make splitting more smart.

+1
source

This can be done using the first string multi-level index for column names, and then change the data form to stack .

 import pandas as pd import numpy as np # some artificial data # ================================== multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex']) np.random.seed(0) df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index) df group0Low group0High group1Low group1High split sex 0 0 222 97 167 242 1 117 245 153 59 1 0 261 71 292 86 1 137 120 266 138 # processing # ============================== level_group = np.where(df.columns.str.contains('0'), 0, 1) # output: array([0, 0, 1, 1]) level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high') # output: array(['low', 'high', 'low', 'high'], dtype='<U4') multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val']) df.columns = multi_level_columns df.stack(level='group') val high low split sex group 0 0 0 97 222 1 242 167 1 0 245 117 1 59 153 1 0 0 71 261 1 86 292 1 0 120 137 1 138 266 
+1
source

All Articles