Convert dict constructor to Pandas DataFrame MultiIndex

Question

Convert dict constructor to Pandas DataFrame MultiIndex

I have a lot of data that I would like to structure in the Pandas framework. However, for this I need a multi-index format. The Pandas MultiIndex function has always confused me, and this time I can not get around it.

I built the structure the way I want it as a dict, but since my actual data is much larger, I want to use Pandas. Below is the dict code. Note that the source data contains a lot more labels and more lines.

The idea is that the source data contains task rows with the Task_n index that were executed by the participant with the Participant_n index. Each row is a segment. Despite the fact that the source data does not have this difference, I want to add this to my data file. In other words:

 Participant_n | Task_n | val | dur ---------------------------------- 1 | 1 | 12 | 2 1 | 1 | 3 | 4 1 | 1 | 4 | 12 1 | 2 | 11 | 11 1 | 2 | 34 | 4

The above example contains one participant, two tasks, respectively three and two segments (rows).

In Python with a dict structure, it looks like this:

 import pandas as pd cols = ['Participant_n', 'Task_n', 'val', 'dur'] data = [[1,1,25,83], [1,1,4,68], [1,1,9,987], [1,2,98,98], [1,2,84,4], [2,1,9,21], [2,2,15,6], [2,2,185,6], [2,2,18,4], [2,3,8,12], [3,1,7,78], [3,1,12,88], [3,2,12,48]] d = pd.DataFrame(data, columns=cols) part_d = {} for row in d.itertuples(): participant_n = row.Participant_n participant = "participant" + str(participant_n) task = "task" + str(row.Task_n) if participant in part_d: part_d[participant]['all_sum']['val'] += int(row.val) part_d[participant]['all_sum']['dur'] += int(row.dur) else: part_d[participant] = { 'prof': 0 if participant_n < 20 else 1, 'all_sum': { 'val': int(row.val), 'dur': int(row.dur), } } if task in part_d[participant]: # Get already existing keys k = list(part_d[participant][task].keys()) k_int = [] # Only get the ints (ie not all_sum etc.) for n in k: # Get digit from eg seg1 n = n[3:] try: k_int.append(int(n)) except ValueError: pass # Increment max by 1 i = max(k_int) + 1 part_d[participant][task][f"seg{i}"] = { 'val': int(row.val), 'dur': int(row.dur), } part_d[participant][task]['task_sum']['val'] += int(row.val) part_d[participant][task]['task_sum']['dur'] += int(row.dur) else: part_d[participant][task] = { 'seg1': { 'val': int(row.val), 'dur': int(row.dur), }, 'task_sum': { 'val': int(row.val), 'dur': int(row.dur), } } print(part_d)

Ultimately, here I have additional variables, such as: task_sum (the sum of the participant’s task), all_sum (the sum of all actions of the participant), as well as prof , which is an arbitrary Boolean flag. The resulting dict looks like this (not decorated to save space. If you want to check, open a JSON or Python dict in a text editor and decorate):

 {'participant1': {'prof': 0, 'all_sum': {'val': 220, 'dur': 1240}, 'task1': {'seg1': {'val': 25, 'dur': 83}, 'task_sum': {'val': 38, 'dur': 1138}, 'seg2': {'val': 4, 'dur': 68}, 'seg3': {'val': 9, 'dur': 987}}, 'task2': {'seg1': {'val': 98, 'dur': 98}, 'task_sum': {'val': 182, 'dur': 102}, 'seg2': {'val': 84, 'dur': 4}}}, 'participant2': {'prof': 0, 'all_sum': {'val': 235, 'dur': 49}, 'task1': {'seg1': {'val': 9, 'dur': 21}, 'task_sum': {'val': 9, 'dur': 21}}, 'task2': {'seg1': {'val': 15, 'dur': 6}, 'task_sum': {'val': 218, 'dur': 16}, 'seg2': {'val': 185, 'dur': 6}, 'seg3': {'val': 18, 'dur': 4}}, 'task3': {'seg1': {'val': 8, 'dur': 12}, 'task_sum': {'val': 8, 'dur': 12}}}, 'participant3': {'prof': 0, 'all_sum': {'val': 31, 'dur': 214}, 'task1': {'seg1': {'val': 7, 'dur': 78}, 'task_sum': {'val': 19, 'dur': 166}, 'seg2': {'val': 12, 'dur': 88}}, 'task2': {'seg1': {'val': 12, 'dur': 48}, 'task_sum': {'val': 12, 'dur': 48}}}}

Instead of a dictionary, I would like it to end in pd.DataFrame with several indexes that look like the one below or similar. (For simplicity, instead of task1 or seg1 I just used indexes.)

 Participant Prof all_sum Task Task_sum Seg val dur val dur val dur ==================================================================== participant1 0 220 1240 1 38 1138 1 25 83 2 4 68 3 9 987 2 182 102 1 98 98 2 84 4 -------------------------------------------------------------------- participant2 0 235 49 1 9 21 1 9 21 2 218 16 1 15 6 2 185 6 3 18 4 3 8 12 1 8 12 -------------------------------------------------------------------- participant3 0 31 214 1 19 166 1 7 78 2 12 88 2 12 48 1 12 48

Is this a structure that is possible in Pandas? If not, what are some reasonable alternatives?

Again, I must emphasize that in fact there is much more data and possibly more sublevels. Therefore, the solution must be flexible and efficient. If this simplifies the situation, I am ready to have only a multi-index on one axis and change the title to:

 Participant Prof all_sum_val all_sum_dur Task Task_sum_val Task_sum_dur Seg

The main problem that I encountered is that I do not understand how I can create a multi-index df if I do not know the sizes in advance. I do not know in advance how many tasks or segments there will be. So I'm sure I can save the loop construct from my original dict approach, and I guess I would have to add / concat to the original empty DataFrame, but the question is what the structure should be. This cannot be a simple series because it does not take into account the multi-index. So how?

For people who have read this far and want to try it myself, I think that my original code can be reused for the most part (loop and variable assignment), but instead of dict it should be accessories to the DataFrame. This is an import aspect: data should be easy to read with getters / setters, just like a regular DataFrame. For example. It should be easy to get a duration value for two participants, task 2, segment 2, etc. But also, getting a subset of the data (e.g. where prof === 0 ) should be no problem.

+8

python dictionary python-3.x pandas dataframe

Bram vanroy Mar 6 '18 at 13:12

source share

2 answers

Gabriel a · Answer 1 · 2018-03-10T21:51:40+0000

My only suggestion is to get rid of your entire vocabulary. All this code can be rewritten in Pandas without much effort. This is likely to accelerate the transformation process, but will take some time. To help you in this process, I rewrote the section that you provided. The rest is up to you.

 import pandas as pd cols = ['Participant_n', 'Task_n', 'val', 'dur'] data = [[1,1,25,83], [1,1,4,68], [1,1,9,987], [1,2,98,98], [1,2,84,4], [2,1,9,21], [2,2,15,6], [2,2,185,6], [2,2,18,4], [2,3,8,12], [3,1,7,78], [3,1,12,88], [3,2,12,48]] df = pd.DataFrame(data, columns=cols) df["Task Sum val"] = df.groupby(["Participant_n","Task_n"])["val"].transform("sum") df["Task Sum dur"] = df.groupby(["Participant_n","Task_n"])["dur"].transform("sum") df["seg"] =df.groupby(["Participant_n","Task_n"]).cumcount() + 1 df["All Sum val"] = df.groupby("Participant_n")["val"].transform("sum") df["All Sum dur"] = df.groupby("Participant_n")["dur"].transform("sum") df = df.set_index(["Participant_n","All Sum val","All Sum dur","Task_n","Task Sum val","Task Sum dur"])[["seg","val","dur"]] df = df.sort_index() df

Exit

  seg val dur Participant_n All Sum val All Sum dur Task_n Task Sum val Task Sum dur 1 220 1240 1 38 1138 1 25 83 1138 2 4 68 1138 3 9 987 2 182 102 1 98 98 102 2 84 4 2 235 49 1 9 21 1 9 21 2 218 16 1 15 6 16 2 185 6 16 3 18 4 3 8 12 1 8 12 3 31 214 1 19 166 1 7 78 166 2 12 88 2 12 48 1 12 48

Try running this code and let me know what you think. Comment with any questions.

Haleemur ali · Answer 2 · 2018-03-15T04:53:42+0000

I ran into a similar problem with data presentation and came up with the following helper functions for groupby with subtotals.

Using this process, you can create subtotals for an arbitrary number of groups by columns, but the output is in a different format. Instead of subtotals placed in their own columns, each subtotal adds an additional row to the data frame.

For interactive research and data analysis, I find this very useful, as you can get subtotals with just a few lines of code

 def get_subtotals(frame, columns, aggvalues, subtotal_level): if subtotal_level == 0: return frame.groupby(columns, as_index=False).agg(aggvalues) elif subtotal_level == len(columns): return pd.DataFrame(frame.agg(aggvalues)).transpose().assign( **{c: np.nan for i, c in enumerate(columns)} ) return frame.groupby( columns[:subtotal_level], as_index=False ).agg(aggvalues).assign( **{c: np.nan for i, c in enumerate(columns[subtotal_level:])} ) def groupby_with_subtotals(frame, columns, aggvalues, grand_totals=False, totals_position='last'): gt = 1 if grand_totals else 0 out = pd.concat( [get_subtotals(df, columns, aggvalues, i) for i in range(len(columns)+gt)] ).sort_values(columns, na_position=totals_position) out[columns] = out[columns].fillna('total') return out.set_index(columns)

recover data creation code from Gabriel Answer

 cols = ['Participant_n', 'Task_n', 'val', 'dur'] data = [[1,1,25,83], [1,1,4,68], [1,1,9,987], [1,2,98,98], [1,2,84,4], [2,1,9,21], [2,2,15,6], [2,2,185,6], [2,2,18,4], [2,3,8,12], [3,1,7,78], [3,1,12,88], [3,2,12,48]] df = pd.DataFrame(data, columns=cols)

You must add the seg column first

 df['seg'] = df.groupby(['Participant_n', 'Task_n']).cumcount() + 1

Then we can use groupby_with_subtotals as follows. Also, note that you can place subtotals on top, and also include grand_totals by going to grand_totals=True, totals_position='first'

 groupby_columns = ['Participant_n', 'Task_n', 'seg'] groupby_aggs = {'val': 'sum', 'dur': 'sum'} aggdf = groupby_with_subtotals(df, groupby_columns, groupby_aggs) aggdf # outputs dur val Participant_n Task_n seg 1 1.0 1.0 83 25 2.0 68 4 3.0 987 9 total 1138 38 2.0 1.0 98 98 2.0 4 84 total 102 182 total total 1240 220 2 1.0 1.0 21 9 total 21 9 2.0 1.0 6 15 2.0 6 185 3.0 4 18 total 16 218 3.0 1.0 12 8 total 12 8 total total 49 235 3 1.0 1.0 78 7 2.0 88 12 total 166 19 2.0 1.0 48 12 total 48 12 total total 214 31

Here the subtotal rows are marked total , and the leftmost total indicates the subtotal level.

Once a cumulative data frame is created, it can be accessed with intermediate results using loc . Example:

 aggdf.loc[1,'total','total'] # outputs: dur 1240 val 220 Name: (1, total, total), dtype: int64

Convert dict constructor to Pandas DataFrame MultiIndex

More articles: