I am struggling with pandas regarding how to group multiple column values with conditions:
This is how my data looks like a pandas dataframe:
id trigger timestamp
1 started 2017-10-01 14:00:1
1 ended 2017-10-04 12:00:1
2 started 2017-10-02 10:00:1
1 started 2017-10-03 11:00:1
2 ended 2017-10-04 12:00:1
2 started 2017-10-05 15:00:1
1 ended 2017-10-05 16:00:1
2 ended 2017-10-05 17:00:1
My goal is to find the difference in day / hour or minute between dates grouped by ID.
My output should look larger (diff in hrs):
id trigger timestamp trigger timestamp diff
1 started 2017-10-01 14:00:1 ended 2017-10-04 12:00:1 70
1 started 2017-10-03 11:00:1 ended 2017-10-05 16:00:1 53
2 started 2017-10-02 10:00:1 ended 2017-10-04 12:00:1 26
2 started 2017-10-05 15:00:1 ended 2017-10-05 17:00:1 2
I tried many options, but I can not find the most effective solution.
Here is my code so far:
First I tried to break the data into 'start' and 'ended':
df['started'] = df.groupby(['id', 'timestamp'])['trigger'] == 'started'
df['ended'] = df.groupby(['id', 'timestamp'])['trigger'] == 'ended'
and then:
df.groupby(['id', 'started', 'ended'], as_index=True).sum()
but it does not work. or
df['started'] = df.groupby(['trigger'])['timestamp'].np.where(df['trigger']=='started')
also without bowel results.
, pandas?
, df.fillna(method='ffill') NaN .
.