I analyze the time series and, based on certain criteria, I can highlight the lines that are either start or end events. At this point, my series looks something like this (I did not take into account some duplicate values โโfor brevity):
Customization
import numpy as np import pandas from pandas import Timestamp datadict = {'event': { Timestamp('2010-01-01 00:20:00', tz=None): 'event start', Timestamp('2010-01-01 00:30:00', tz=None): '--', Timestamp('2010-01-01 00:40:00', tz=None): '--', Timestamp('2010-01-01 00:50:00', tz=None): '--', Timestamp('2010-01-01 01:00:00', tz=None): '--', Timestamp('2010-01-01 01:10:00', tz=None): 'event end', Timestamp('2010-01-01 01:20:00', tz=None): '--', Timestamp('2010-01-01 02:20:00', tz=None): '--', Timestamp('2010-01-01 02:30:00', tz=None): 'event start', Timestamp('2010-01-01 02:40:00', tz=None): '--', Timestamp('2010-01-01 02:50:00', tz=None): '--', Timestamp('2010-01-01 03:00:00', tz=None): '--', Timestamp('2010-01-01 03:10:00', tz=None): '--', Timestamp('2010-01-01 03:20:00', tz=None): '--', Timestamp('2010-01-01 03:30:00', tz=None): 'event end', }} data = pandas.DataFrame.from_dict(datadict) event 2010-01-01 00:20:00 event start 2010-01-01 00:30:00 -- 2010-01-01 00:40:00 -- 2010-01-01 00:50:00 -- 2010-01-01 01:00:00 -- 2010-01-01 01:10:00 event end 2010-01-01 01:20:00 -- 2010-01-01 02:20:00 -- 2010-01-01 02:30:00 event start 2010-01-01 02:40:00 -- 2010-01-01 02:50:00 -- 2010-01-01 03:00:00 -- 2010-01-01 03:10:00 -- 2010-01-01 03:20:00 -- 2010-01-01 03:30:00 event end
Here is what I would like to achieve (ideally without for loops)
event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- NA 2010-01-01 02:20:00 -- NA 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2 2010-01-01 03:40:00 -- NA 2010-01-01 03:50:00 -- NA
Here is what I tried
With some optimistic assumptions about the quality of my data, I can get Event Numbers as follows:
table = data[data.event != '--'].reset_index() table['event number'] = 1 + np.floor(table.index / 2) table = table.set_index('index') event event number index 2010-01-01 00:20:00 event start 1 2010-01-01 01:10:00 event end 1 2010-01-01 02:30:00 event start 2 2010-01-01 03:30:00 event end 2
I can then join to my original data framework, and fillna with method='ffill'
data2 = data.join(table[['event number']]) data2['filled'] = data2['event number'].fillna(method='ffill') event event number filled 2010-01-01 00:20:00 event start 1 1 2010-01-01 00:30:00 -- NaN 1 2010-01-01 00:40:00 -- NaN 1 2010-01-01 00:50:00 -- NaN 1 2010-01-01 01:00:00 -- NaN 1 2010-01-01 01:10:00 event end 1 1 2010-01-01 01:20:00 -- NaN 1 # <- d'oh 2010-01-01 02:20:00 -- NaN 1 # <- d'oh 2010-01-01 02:30:00 event start 2 2 2010-01-01 02:40:00 -- NaN 2 2010-01-01 02:50:00 -- NaN 2 2010-01-01 03:00:00 -- NaN 2 2010-01-01 03:10:00 -- NaN 2 2010-01-01 03:20:00 -- NaN 2 2010-01-01 03:30:00 event end 2 2
Problem
As you can see, the time between events (01:20 - 02:20) is associated with event # 1.
Question
Can I skip these sections without a loop?