I have a lot of data that I would like to structure in the Pandas framework. However, for this I need a multi-index format. The Pandas MultiIndex function has always confused me, and this time I can not get around it.
I built the structure the way I want it as a dict, but since my actual data is much larger, I want to use Pandas. Below is the dict code. Note that the source data contains a lot more labels and more lines.
The idea is that the source data contains task rows with the Task_n index that were executed by the participant with the Participant_n index. Each row is a segment. Despite the fact that the source data does not have this difference, I want to add this to my data file. In other words:
Participant_n | Task_n | val | dur ---------------------------------- 1 | 1 | 12 | 2 1 | 1 | 3 | 4 1 | 1 | 4 | 12 1 | 2 | 11 | 11 1 | 2 | 34 | 4
The above example contains one participant, two tasks, respectively three and two segments (rows).
In Python with a dict structure, it looks like this:
import pandas as pd cols = ['Participant_n', 'Task_n', 'val', 'dur'] data = [[1,1,25,83], [1,1,4,68], [1,1,9,987], [1,2,98,98], [1,2,84,4], [2,1,9,21], [2,2,15,6], [2,2,185,6], [2,2,18,4], [2,3,8,12], [3,1,7,78], [3,1,12,88], [3,2,12,48]] d = pd.DataFrame(data, columns=cols) part_d = {} for row in d.itertuples(): participant_n = row.Participant_n participant = "participant" + str(participant_n) task = "task" + str(row.Task_n) if participant in part_d: part_d[participant]['all_sum']['val'] += int(row.val) part_d[participant]['all_sum']['dur'] += int(row.dur) else: part_d[participant] = { 'prof': 0 if participant_n < 20 else 1, 'all_sum': { 'val': int(row.val), 'dur': int(row.dur), } } if task in part_d[participant]: # Get already existing keys k = list(part_d[participant][task].keys()) k_int = [] # Only get the ints (ie not all_sum etc.) for n in k: # Get digit from eg seg1 n = n[3:] try: k_int.append(int(n)) except ValueError: pass # Increment max by 1 i = max(k_int) + 1 part_d[participant][task][f"seg{i}"] = { 'val': int(row.val), 'dur': int(row.dur), } part_d[participant][task]['task_sum']['val'] += int(row.val) part_d[participant][task]['task_sum']['dur'] += int(row.dur) else: part_d[participant][task] = { 'seg1': { 'val': int(row.val), 'dur': int(row.dur), }, 'task_sum': { 'val': int(row.val), 'dur': int(row.dur), } } print(part_d)
Ultimately, here I have additional variables, such as: task_sum (the sum of the participant’s task), all_sum (the sum of all actions of the participant), as well as prof , which is an arbitrary Boolean flag. The resulting dict looks like this (not decorated to save space. If you want to check, open a JSON or Python dict in a text editor and decorate):
{'participant1': {'prof': 0, 'all_sum': {'val': 220, 'dur': 1240}, 'task1': {'seg1': {'val': 25, 'dur': 83}, 'task_sum': {'val': 38, 'dur': 1138}, 'seg2': {'val': 4, 'dur': 68}, 'seg3': {'val': 9, 'dur': 987}}, 'task2': {'seg1': {'val': 98, 'dur': 98}, 'task_sum': {'val': 182, 'dur': 102}, 'seg2': {'val': 84, 'dur': 4}}}, 'participant2': {'prof': 0, 'all_sum': {'val': 235, 'dur': 49}, 'task1': {'seg1': {'val': 9, 'dur': 21}, 'task_sum': {'val': 9, 'dur': 21}}, 'task2': {'seg1': {'val': 15, 'dur': 6}, 'task_sum': {'val': 218, 'dur': 16}, 'seg2': {'val': 185, 'dur': 6}, 'seg3': {'val': 18, 'dur': 4}}, 'task3': {'seg1': {'val': 8, 'dur': 12}, 'task_sum': {'val': 8, 'dur': 12}}}, 'participant3': {'prof': 0, 'all_sum': {'val': 31, 'dur': 214}, 'task1': {'seg1': {'val': 7, 'dur': 78}, 'task_sum': {'val': 19, 'dur': 166}, 'seg2': {'val': 12, 'dur': 88}}, 'task2': {'seg1': {'val': 12, 'dur': 48}, 'task_sum': {'val': 12, 'dur': 48}}}}
Instead of a dictionary, I would like it to end in pd.DataFrame with several indexes that look like the one below or similar. (For simplicity, instead of task1 or seg1 I just used indexes.)
Participant Prof all_sum Task Task_sum Seg val dur val dur val dur ==================================================================== participant1 0 220 1240 1 38 1138 1 25 83 2 4 68 3 9 987 2 182 102 1 98 98 2 84 4 -------------------------------------------------------------------- participant2 0 235 49 1 9 21 1 9 21 2 218 16 1 15 6 2 185 6 3 18 4 3 8 12 1 8 12 -------------------------------------------------------------------- participant3 0 31 214 1 19 166 1 7 78 2 12 88 2 12 48 1 12 48
Is this a structure that is possible in Pandas? If not, what are some reasonable alternatives?
Again, I must emphasize that in fact there is much more data and possibly more sublevels. Therefore, the solution must be flexible and efficient. If this simplifies the situation, I am ready to have only a multi-index on one axis and change the title to:
Participant Prof all_sum_val all_sum_dur Task Task_sum_val Task_sum_dur Seg
The main problem that I encountered is that I do not understand how I can create a multi-index df if I do not know the sizes in advance. I do not know in advance how many tasks or segments there will be. So I'm sure I can save the loop construct from my original dict approach, and I guess I would have to add / concat to the original empty DataFrame, but the question is what the structure should be. This cannot be a simple series because it does not take into account the multi-index. So how?
For people who have read this far and want to try it myself, I think that my original code can be reused for the most part (loop and variable assignment), but instead of dict it should be accessories to the DataFrame. This is an import aspect: data should be easy to read with getters / setters, just like a regular DataFrame. For example. It should be easy to get a duration value for two participants, task 2, segment 2, etc. But also, getting a subset of the data (e.g. where prof === 0 ) should be no problem.