Fill NA values โ€‹โ€‹in pandas Series with a stop

I analyze the time series and, based on certain criteria, I can highlight the lines that are either start or end events. At this point, my series looks something like this (I did not take into account some duplicate values โ€‹โ€‹for brevity):

Customization

import numpy as np import pandas from pandas import Timestamp datadict = {'event': { Timestamp('2010-01-01 00:20:00', tz=None): 'event start', Timestamp('2010-01-01 00:30:00', tz=None): '--', Timestamp('2010-01-01 00:40:00', tz=None): '--', Timestamp('2010-01-01 00:50:00', tz=None): '--', Timestamp('2010-01-01 01:00:00', tz=None): '--', Timestamp('2010-01-01 01:10:00', tz=None): 'event end', Timestamp('2010-01-01 01:20:00', tz=None): '--', Timestamp('2010-01-01 02:20:00', tz=None): '--', Timestamp('2010-01-01 02:30:00', tz=None): 'event start', Timestamp('2010-01-01 02:40:00', tz=None): '--', Timestamp('2010-01-01 02:50:00', tz=None): '--', Timestamp('2010-01-01 03:00:00', tz=None): '--', Timestamp('2010-01-01 03:10:00', tz=None): '--', Timestamp('2010-01-01 03:20:00', tz=None): '--', Timestamp('2010-01-01 03:30:00', tz=None): 'event end', }} data = pandas.DataFrame.from_dict(datadict) event 2010-01-01 00:20:00 event start 2010-01-01 00:30:00 -- 2010-01-01 00:40:00 -- 2010-01-01 00:50:00 -- 2010-01-01 01:00:00 -- 2010-01-01 01:10:00 event end 2010-01-01 01:20:00 -- 2010-01-01 02:20:00 -- 2010-01-01 02:30:00 event start 2010-01-01 02:40:00 -- 2010-01-01 02:50:00 -- 2010-01-01 03:00:00 -- 2010-01-01 03:10:00 -- 2010-01-01 03:20:00 -- 2010-01-01 03:30:00 event end 

Here is what I would like to achieve (ideally without for loops)

  event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- NA 2010-01-01 02:20:00 -- NA 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2 2010-01-01 03:40:00 -- NA 2010-01-01 03:50:00 -- NA 

Here is what I tried

With some optimistic assumptions about the quality of my data, I can get Event Numbers as follows:

 table = data[data.event != '--'].reset_index() table['event number'] = 1 + np.floor(table.index / 2) table = table.set_index('index') event event number index 2010-01-01 00:20:00 event start 1 2010-01-01 01:10:00 event end 1 2010-01-01 02:30:00 event start 2 2010-01-01 03:30:00 event end 2 

I can then join to my original data framework, and fillna with method='ffill'

 data2 = data.join(table[['event number']]) data2['filled'] = data2['event number'].fillna(method='ffill') event event number filled 2010-01-01 00:20:00 event start 1 1 2010-01-01 00:30:00 -- NaN 1 2010-01-01 00:40:00 -- NaN 1 2010-01-01 00:50:00 -- NaN 1 2010-01-01 01:00:00 -- NaN 1 2010-01-01 01:10:00 event end 1 1 2010-01-01 01:20:00 -- NaN 1 # <- d'oh 2010-01-01 02:20:00 -- NaN 1 # <- d'oh 2010-01-01 02:30:00 event start 2 2 2010-01-01 02:40:00 -- NaN 2 2010-01-01 02:50:00 -- NaN 2 2010-01-01 03:00:00 -- NaN 2 2010-01-01 03:10:00 -- NaN 2 2010-01-01 03:20:00 -- NaN 2 2010-01-01 03:30:00 event end 2 2 

Problem

As you can see, the time between events (01:20 - 02:20) is associated with event # 1.

Question

Can I skip these sections without a loop?

+4
source share
1 answer

you can achieve this by simply looking at the summation of the event start number and the event end number:

 >>> data['event number'] = (data.event == 'event start').cumsum() >>> data event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- 1 2010-01-01 02:20:00 -- 1 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2 

now you just need to set nan when there is no event; but these places correspond to the lines where the cumulative summation of event start is equal to the total summation of event end (with a shift of 1 line)

 >>> idx = data['event number'] == (data.event.shift(1) == 'event end').cumsum() >>> data.loc[idx, 'event number'] = np.nan >>> data event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- NaN 2010-01-01 02:20:00 -- NaN 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2 [15 rows x 2 columns] 
+5
source

All Articles