Calculate state duration using pandas Dataframe

I am trying to calculate how often a condition is entered and how long it lasts. For example, I have three possible states 1,2 and 3, the state of which is registered in the pandas Dataframe:

test = pd.DataFrame([2,2,2,1,1,1,2,2,2,3,2,2,1,1], index=pd.date_range('00:00', freq='1h', periods=14)) 

For example, state 1 is entered twice (at index 3 and 12), the first time it lasts three hours, the second time two hours (so on average 2.5). State 2 is administered 3 times, an average of 2.66 hours.

I know that I can mask data that I'm not interested in, for example, to analyze state 1:

 state1 = test.mask(test!=1) 

but from there I can’t find a way to continue.

+5
source share
1 answer

Hope the comments provide enough explanation. The key point is that you can use the custom crop function and then cumsum to group the lines into "clumps" of the same state.

 # set things up freq = "1h" df = pd.DataFrame( [2,2,2,1,1,1,2,2,2,3,2,2,1,1], index=pd.date_range('00:00', freq=freq, periods=14) ) # add a column saying if a row belongs to the same state as the one before it df["is_first"] = pd.rolling_apply(df, 2, lambda x: x[0] != x[1]).fillna(1) # the cumulative sum - each "clump" gets its own integer id df["value_group"] = df["is_first"].cumsum() # get the rows corresponding to states beginning start = df.groupby("value_group", as_index=False).nth(0) # get the rows corresponding to states ending end = df.groupby("value_group", as_index=False).nth(-1) # put the timestamp indexes of the "first" and "last" state measurements into # their own data frame start_end = pd.DataFrame( { "start": start.index, # add freq to get when the state ended "end": end.index + pd.Timedelta(freq), "value": start[0] } ) # convert timedeltas to seconds (float) start_end["duration"] = ( (start_end["end"] - start_end["start"]).apply(float) / 1e9 ) # get average state length and counts agg = start_end.groupby("value").agg(["mean", "count"])["duration"] agg["mean"] = agg["mean"] / (60 * 60) 

And the conclusion:

  mean count value 1 2.500000 2 2 2.666667 3 3 1.000000 1 
+6
source

All Articles