Calculate time in a specific state for time series data

I have an irregularly indexed time series of data with a resolution of seconds, for example:

import pandas as pd idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43', '2012-09-26 18:35:11', '2012-11-11 2:34:59'] status = [1, 0, 1, 0] df = pd.DataFrame(status, index=idx, columns = ['status']) df = df.reindex(pd.to_datetime(df.index)) In [62]: df Out[62]: status 2012-01-01 12:43:35 1 2012-03-12 15:46:43 0 2012-09-26 18:35:11 1 2012-11-11 02:34:59 0 

and I'm interested in the fraction of the year when the status is 1. As I do now, I review df every second of the year and use formatting as follows:

 full_idx = pd.date_range(start = '1/1/2012', end = '12/31/2012', freq='s') df1 = df.reindex(full_idx, method='ffill') 

which returns a DataFrame that contains every second for a year, which I can then calculate the average to see the percentage of time in status 1 , for example:

 In [66]: df1 Out[66]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 31536001 entries, 2012-01-01 00:00:00 to 2012-12-31 00:00:00 Freq: S Data columns: status 31490186 non-null values dtypes: float64(1) In [67]: df1.status.mean() Out[67]: 0.31953371123308066 

The problem is that I have to do this for a lot of data, and reindexing it every second of the year is the most expensive operation.

What are the best ways to do this?

+4
source share
2 answers

It seems that there is no pandas method for calculating time differences between records of an irregular time series, although there is a convenient method for converting the index of a time series into an array of datetime.datetime objects, which can be converted to datetime.timedelta objects by subtraction.

 In [6]: start_end = pd.DataFrame({'status': [0, 0]}, index=[pd.datetools.parse('1/1/2012'), pd.datetools.parse('12/31/2012')]) In [7]: df = df.append(start_end).sort() In [8]: df Out[8]: status 2012-01-01 00:00:00 0 2012-01-01 12:43:35 1 2012-03-12 15:46:43 0 2012-09-26 18:35:11 1 2012-11-11 02:34:59 0 2012-12-31 00:00:00 0 In [9]: pydatetime = pd.Series(df.index.to_pydatetime(), index=df.index) In [11]: df['duration'] = pydatetime.diff().shift(-1).\ map(datetime.timedelta.total_seconds, na_action='ignore') In [16]: df Out[16]: status duration 2012-01-01 00:00:00 0 45815 2012-01-01 12:43:35 1 6145388 2012-03-12 15:46:43 0 17117308 2012-09-26 18:35:11 1 3916788 2012-11-11 02:34:59 0 4310701 2012-12-31 00:00:00 0 NaN In [17]: (df.status * df.duration).sum() / df.duration.sum() Out[17]: 0.31906950786402843 

Note:

  • Our answers seem different, because I set status before the first timestamp is zero, whereas these entries are NA in your df1 , since there is no initial value to forward the fill, and NA values โ€‹โ€‹are excluded with pandas mean ().
  • timedelta.total_seconds() is new in Python 2.7.
  • Dates for comparing this method and reindexing:

     In [8]: timeit delta_method(df) 1000 loops, best of 3: 1.3 ms per loop In [9]: timeit redindexing(df) 1 loops, best of 3: 2.78 s per loop 
+3
source

Another potential approach is to use traces .

 import traces from dateutil.parser import parse as date_parse idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43', '2012-09-26 18:35:11', '2012-11-11 2:34:59'] status = [1, 0, 1, 0] # create a TimeSeries from date strings and status ts = traces.TimeSeries(default=0) for date_string, status_value in zip(idx, status): ts[date_parse(date_string)] = status_value # compute distribution ts.distribution( start=date_parse('2012-01-01'), end=date_parse('2013-01-01'), ) # {0: 0.6818022667476219, 1: 0.31819773325237805} 

The value is calculated between the beginning of January 1, 2012 and the end of December 31, 2012 (equivalent to the beginning of January 1, 2013) without re-sampling and provided that at the beginning of the year the status is 0 (parameter default=0 )

Sync Results:

 In [2]: timeit ts.distribution( start=date_parse('2012-01-01'), end=date_parse('2013-01-01') ) 619 ยตs ยฑ 7.25 ยตs per loop (mean ยฑ std. dev. of 7 runs, 1000 loops each) 
+1
source

All Articles