Pandas time between events between events

How to calculate the time (number of days) between “events” in a Pandas time series? For example, if I have the time series below, I would like to know every day in the series how many days have passed since the last TRUE

  event 2010-01-01 False 2010-01-02 True 2010-01-03 False 2010-01-04 False 2010-01-05 True 2010-01-06 False 

The way I did this seems complicated, so I hope for something more elegant. Obviously a for loop repeating line by line will work, but I'm looking for the perfect (scalable) solution. My current attempt is below:

 date_range = pd.date_range('2010-01-01', '2010-01-06') df = pd.DataFrame([False, True, False, False, True, False], index=date_range, columns=['event']) event_dates = df.index[df['event']] df2 = pd.DataFrame(event_dates, index=event_dates, columns=['max_event_date']) df = df.join(df2) df['max_event_date'] = df['max_event_date'].cummax(axis=0, skipna=False) df['days_since_event'] = df.index - df['max_event_date'] event max_event_date days_since_event 2010-01-01 False NaT NaT 2010-01-02 True 2010-01-02 0 days 2010-01-03 False 2010-01-02 1 days 2010-01-04 False 2010-01-02 2 days 2010-01-05 True 2010-01-05 0 days 2010-01-06 False 2010-01-05 1 days 
+7
python pandas
source share
3 answers

Continuing to improve this answer and hoping that someone comes in with the "Putin" way. Until then, I think this latest update works best.

 last = pd.to_datetime(np.nan) def elapsed(row): if not row.event: return row.name - last else: global last last = row.name return row.name-last df['elapsed'] = df.apply(elapsed,axis=1) df event elapsed 2010-01-01 False NaT 2010-01-02 True 0 days 2010-01-03 False 1 days 2010-01-04 False 2 days 2010-01-05 True 0 days 2010-01-06 False 1 days 

:::::

Leaving the previous answers below, although they are not optimal

:::

Instead of doing multiple passes, it seems easier to just scroll through the indices

 df['elapsed'] = 0 for i in df.index[1:]: if not df['event'][i]: df['elapsed'][i] = df['elapsed'][i-1] + 1 

::::

Let's say Trues is your event of interest.

 trues = df[df.event==True] trues.Dates = trues.index #need this because .diff() doesn't work on the index trues.Elapsed = trues.Dates.diff() 
+3
source share

A single-pass solution would certainly be ideal, but here is a multi-pass solution using only (presumably) cythonized pandas functions:

 def get_delay(ds): x1 = (~ds).cumsum() x2 = x1.where(ds, np.nan).ffill() return x1 - x2 date_range = pd.date_range('2010-01-01', '2010-01-06') ds = pd.Series([False, True, False, False, True, False], index=date_range) pd.concat([ds, get_delay(ds)], axis=1) Event Last 2010-01-01 False NaN 2010-01-02 True 0 2010-01-03 False 1 2010-01-04 False 2 2010-01-05 True 0 2010-01-06 False 1 

And interestingly, in some quick tests, it seems a little better, perhaps due to the fact that you avoid line-by-line actions:

 %%timeit -n 1000 def get_delay(ds): x1 = (~ds).cumsum() x2 = x1.where(ds, np.nan).ffill() return x1 - x2 n = 100 events = np.random.choice([True, False], size=n) date_range = pd.date_range('2010-01-01', periods=n) df = pd.DataFrame(events, index=date_range, columns=['event']) get_delay(df['event']) 1000 loops, best of 3: 1.09 ms per loop 

Unlike the one-loop approach with the global one:

 %%timeit -n 1000 last = pd.to_datetime(np.nan) def elapsed(row): if not row.event: return row.name - last else: global last last = row.name return row.name-last n = 100 events = np.random.choice([True, False], size=n) date_range = pd.date_range('2010-01-01', periods=n) df = pd.DataFrame(events, index=date_range, columns=['event']) df.apply(elapsed, axis=1) 1000 loops, best of 3: 2.4 ms per loop 

Perhaps there is some nuance in this comparison that does not make it honest, but in any case, the version without custom functions, of course, does not look much slower, if at all.

+2
source share

Recently, I came across groupby().diff() , which can offer the following method:

  • Use groupby.diff to calculate the days until the last day True :

     df.loc[df.index[-1]+pd.Timedelta(days=1), 'event'] = True # add an artificial True day for interpolation df['last']=df.index df['last']=df.groupby('event')['last'].diff() df.loc[df['event']==False, 'last'] = None 

    which gives you:

      event last 2010-01-01 False NaT 2010-01-02 True NaT 2010-01-03 False NaT 2010-01-04 False NaT 2010-01-05 True 3 days 2010-01-06 False NaT 2010-01-07 True 2 days 
  • Use tshift() to set the correct last value for True and False to:

     df['last'] = (df['last']-pd.Timedelta(days=1)).tshift(periods=-1, freq='D') df.loc[df['event'], ['last']] = pd.Timedelta(days=0) 

    You'll get:

      event last 2010-01-01 False NaT 2010-01-02 True 0 days 2010-01-03 False NaT 2010-01-04 False 2 days 2010-01-05 True 0 days 2010-01-06 False 1 days 2010-01-07 True 0 days 
  • Finally, interpolate the NaN values ​​linearly to get the final result

     df['last'] /= np.timedelta64(1, 'D') df.interpolate(method='linear', axis=0, inplace=True) df.drop(df.index[-1], inplace=True) # erase the artificial row df['last'] *= np.timedelta64(1, 'D') event last 2010-01-01 False NaN 2010-01-02 True 0 days 2010-01-03 False 1 days 2010-01-04 False 2 days 2010-01-05 True 0 days 2010-01-06 False 1 days 
+1
source share

All Articles