The most efficient way to determine overlap periods in Python

I am trying to determine what percentage of the time overlaps two time series using the python pandas library. Data is not synchronous, so the time for each data point does not line up. Here is an example:

Time Series 1

2016-10-05 11:50:02.000734 0.50 2016-10-05 11:50:03.000033 0.25 2016-10-05 11:50:10.000479 0.50 2016-10-05 11:50:15.000234 0.25 2016-10-05 11:50:37.000199 0.50 2016-10-05 11:50:49.000401 0.50 2016-10-05 11:50:51.000362 0.25 2016-10-05 11:50:53.000424 0.75 2016-10-05 11:50:53.000982 0.25 2016-10-05 11:50:58.000606 0.75 

Time Series 2

 2016-10-05 11:50:07.000537 0.50 2016-10-05 11:50:11.000994 0.50 2016-10-05 11:50:19.000181 0.50 2016-10-05 11:50:35.000578 0.50 2016-10-05 11:50:46.000761 0.50 2016-10-05 11:50:49.000295 0.75 2016-10-05 11:50:51.000835 0.75 2016-10-05 11:50:55.000792 0.25 2016-10-05 11:50:55.000904 0.75 2016-10-05 11:50:57.000444 0.75 

Assuming that the series retains its value until the next change, what is the most effective way to determine the percentage of time when they have the same value?

Example

It allows you to calculate the overlap time of these series starting from 11:50: 07.000537 and ending on 2016-10-05 11: 50: 57.000444 0.75, since we have data for both series for this period. The time during which the overlap occurs:

  • 11: 50: 10.000479 - 11: 50: 15.000234 (both have a value of 0.5) 4.999755 seconds
  • 11: 50: 37.000199 - 11: 50: 49.000295 (both have a value of 0.5) 12.000096 seconds
  • 11: 50: 53.000424 - 11: 50: 53.000982 (both have a value of 0.75) 0.000558 seconds
  • 11: 50: 55.000792 - 11: 50: 55.000904 (both have a value of 0.25) 0.000112 seconds

Result (4.999755 + 12.000096 + 0.000558 + 0.000112) /49.999907 = 34%

One of the problems is that my actual timers have much more data, like 1000-10000 observations, and I need to run many more pairs. I thought about forwarding the filling of the series and then just comparing the lines and dividing the total number of matches by the total number of lines, but I do not think that would be very efficient.

+8
performance python pandas time-series pandas-groupby
source share
2 answers

Customization
create 2 time series

 from StringIO import StringIO import pandas as pd txt1 = """2016-10-05 11:50:02.000734 0.50 2016-10-05 11:50:03.000033 0.25 2016-10-05 11:50:10.000479 0.50 2016-10-05 11:50:15.000234 0.25 2016-10-05 11:50:37.000199 0.50 2016-10-05 11:50:49.000401 0.50 2016-10-05 11:50:51.000362 0.25 2016-10-05 11:50:53.000424 0.75 2016-10-05 11:50:53.000982 0.25 2016-10-05 11:50:58.000606 0.75""" s1 = pd.read_csv(StringIO(txt1), sep='\s{2,}', engine='python', parse_dates=[0], index_col=0, header=None, squeeze=True).rename('s1').rename_axis(None) txt2 = """2016-10-05 11:50:07.000537 0.50 2016-10-05 11:50:11.000994 0.50 2016-10-05 11:50:19.000181 0.50 2016-10-05 11:50:35.000578 0.50 2016-10-05 11:50:46.000761 0.50 2016-10-05 11:50:49.000295 0.75 2016-10-05 11:50:51.000835 0.75 2016-10-05 11:50:55.000792 0.25 2016-10-05 11:50:55.000904 0.75 2016-10-05 11:50:57.000444 0.75""" s2 = pd.read_csv(StringIO(txt2), sep='\s{2,}', engine='python', parse_dates=[0], index_col=0, header=None, squeeze=True).rename('s2').rename_axis(None) 

TL; DR

 df = pd.concat([s1, s2], axis=1).ffill().dropna() overlap = df.index.to_series().diff().shift(-1) \ .fillna(0).groupby(df.s1.eq(df.s2)).sum() overlap.div(overlap.sum()) False 0.666657 True 0.333343 Name: duration, dtype: float64 

explanation

build base pd.DataFrame df

  • use pd.concat to align indexes
  • use ffill to push the values ​​forward
  • use dropna to get rid of the values ​​of one series before another run

 df = pd.concat([s1, s2], axis=1).ffill().dropna() df 

enter image description here

calculate 'duration'
from current timestamp to next

 df['duration'] = df.index.to_series().diff().shift(-1).fillna(0) df 

enter image description here

calculate overlap

  • df.s1.eq(df.s2) gives a boolean sequence when s1 overlaps with s2
  • use groupby over boolean series to sum the total duration when True and False

 overlap = df.groupby(df.s1.eq(df.s2)).duration.sum() overlap False 00:00:33.999548 True 00:00:17.000521 Name: duration, dtype: timedelta64[ns] 

percentage of time with the same value

 overlap.div(overlap.sum()) False 0.666657 True 0.333343 Name: duration, dtype: float64 
+5
source share

Cold problem. I rudely forced this w / out to use pandas or numpy, but I got your answer (thanks for developing it). I have not tested it for anything else. I also do not know how fast this happens, since it only goes through each data file once, but does not do any vectorization.

 import pandas as pd ############################################################################# #Preparing the dataframes times_1 = ["2016-10-05 11:50:02.000734","2016-10-05 11:50:03.000033", "2016-10-05 11:50:10.000479","2016-10-05 11:50:15.000234", "2016-10-05 11:50:37.000199","2016-10-05 11:50:49.000401", "2016-10-05 11:50:51.000362","2016-10-05 11:50:53.000424", "2016-10-05 11:50:53.000982","2016-10-05 11:50:58.000606"] times_1 = [pd.Timestamp(t) for t in times_1] vals_1 = [0.50,0.25,0.50,0.25,0.50,0.50,0.25,0.75,0.25,0.75] times_2 = ["2016-10-05 11:50:07.000537","2016-10-05 11:50:11.000994", "2016-10-05 11:50:19.000181","2016-10-05 11:50:35.000578", "2016-10-05 11:50:46.000761","2016-10-05 11:50:49.000295", "2016-10-05 11:50:51.000835","2016-10-05 11:50:55.000792", "2016-10-05 11:50:55.000904","2016-10-05 11:50:57.000444"] times_2 = [pd.Timestamp(t) for t in times_2] vals_2 = [0.50,0.50,0.50,0.50,0.50,0.75,0.75,0.25,0.75,0.75] data_1 = pd.DataFrame({"time":times_1,"vals":vals_1}) data_2 = pd.DataFrame({"time":times_2,"vals":vals_2}) ############################################################################# shared_time = 0 #Keep running tally of shared time t1_ind = 0 #Pointer to row in data_1 dataframe t2_ind = 0 #Pointer to row in data_2 dataframe #Loop through both dataframes once, incrementing either the t1 or t2 index #Stop one before the end of both since do +1 indexing in loop while t1_ind < len(data_1.time)-1 and t2_ind < len(data_2.time)-1: #Get val1 and val2 val1,val2 = data_1.vals[t1_ind], data_2.vals[t2_ind] #Get the start and stop of the current time window t1_start,t1_stop = data_1.time[t1_ind], data_1.time[t1_ind+1] t2_start,t2_stop = data_2.time[t2_ind], data_2.time[t2_ind+1] #If the start of time window 2 is in time window 1 if val1 == val2 and (t1_start <= t2_start <= t1_stop): shared_time += (min(t1_stop,t2_stop)-t2_start).total_seconds() t1_ind += 1 #If the start of time window 1 is in time window 2 elif val1 == val2 and t2_start <= t1_start <= t2_stop: shared_time += (min(t1_stop,t2_stop)-t1_start).total_seconds() t2_ind += 1 #If there is no time window overlap and time window 2 is larger elif t1_start < t2_start: t1_ind += 1 #If there is no time window overlap and time window 1 is larger else: t2_ind += 1 #How I calculated the maximum possible shared time (not pretty) shared_start = max(data_1.time[0],data_2.time[0]) shared_stop = min(data_1.time.iloc[-1],data_2.time.iloc[-1]) max_possible_shared = (shared_stop-shared_start).total_seconds() #Print output print "Shared time:",shared_time print "Total possible shared:",max_possible_shared print "Percent shared:",shared_time*100/max_possible_shared,"%" 

Output:

 Shared time: 17.000521 Total possible shared: 49.999907 Percent shared: 34.0011052421 % 
+3
source share

All Articles