The most efficient way to determine overlap periods in Python

Question

The most efficient way to determine overlap periods in Python

I am trying to determine what percentage of the time overlaps two time series using the python pandas library. Data is not synchronous, so the time for each data point does not line up. Here is an example:

Time Series 1

2016-10-05 11:50:02.000734 0.50 2016-10-05 11:50:03.000033 0.25 2016-10-05 11:50:10.000479 0.50 2016-10-05 11:50:15.000234 0.25 2016-10-05 11:50:37.000199 0.50 2016-10-05 11:50:49.000401 0.50 2016-10-05 11:50:51.000362 0.25 2016-10-05 11:50:53.000424 0.75 2016-10-05 11:50:53.000982 0.25 2016-10-05 11:50:58.000606 0.75

Time Series 2

 2016-10-05 11:50:07.000537 0.50 2016-10-05 11:50:11.000994 0.50 2016-10-05 11:50:19.000181 0.50 2016-10-05 11:50:35.000578 0.50 2016-10-05 11:50:46.000761 0.50 2016-10-05 11:50:49.000295 0.75 2016-10-05 11:50:51.000835 0.75 2016-10-05 11:50:55.000792 0.25 2016-10-05 11:50:55.000904 0.75 2016-10-05 11:50:57.000444 0.75

Assuming that the series retains its value until the next change, what is the most effective way to determine the percentage of time when they have the same value?

Example

It allows you to calculate the overlap time of these series starting from 11:50: 07.000537 and ending on 2016-10-05 11: 50: 57.000444 0.75, since we have data for both series for this period. The time during which the overlap occurs:

11: 50: 10.000479 - 11: 50: 15.000234 (both have a value of 0.5) 4.999755 seconds
11: 50: 37.000199 - 11: 50: 49.000295 (both have a value of 0.5) 12.000096 seconds
11: 50: 53.000424 - 11: 50: 53.000982 (both have a value of 0.75) 0.000558 seconds
11: 50: 55.000792 - 11: 50: 55.000904 (both have a value of 0.25) 0.000112 seconds

Result (4.999755 + 12.000096 + 0.000558 + 0.000112) /49.999907 = 34%

One of the problems is that my actual timers have much more data, like 1000-10000 observations, and I need to run many more pairs. I thought about forwarding the filling of the series and then just comparing the lines and dividing the total number of matches by the total number of lines, but I do not think that would be very efficient.

+8

performance python pandas time-series pandas-groupby

klib Oct 6 '16 at 0:40

source share

2 answers

Cold problem. I rudely forced this w / out to use pandas or numpy, but I got your answer (thanks for developing it). I have not tested it for anything else. I also do not know how fast this happens, since it only goes through each data file once, but does not do any vectorization.

 import pandas as pd ############################################################################# #Preparing the dataframes times_1 = ["2016-10-05 11:50:02.000734","2016-10-05 11:50:03.000033", "2016-10-05 11:50:10.000479","2016-10-05 11:50:15.000234", "2016-10-05 11:50:37.000199","2016-10-05 11:50:49.000401", "2016-10-05 11:50:51.000362","2016-10-05 11:50:53.000424", "2016-10-05 11:50:53.000982","2016-10-05 11:50:58.000606"] times_1 = [pd.Timestamp(t) for t in times_1] vals_1 = [0.50,0.25,0.50,0.25,0.50,0.50,0.25,0.75,0.25,0.75] times_2 = ["2016-10-05 11:50:07.000537","2016-10-05 11:50:11.000994", "2016-10-05 11:50:19.000181","2016-10-05 11:50:35.000578", "2016-10-05 11:50:46.000761","2016-10-05 11:50:49.000295", "2016-10-05 11:50:51.000835","2016-10-05 11:50:55.000792", "2016-10-05 11:50:55.000904","2016-10-05 11:50:57.000444"] times_2 = [pd.Timestamp(t) for t in times_2] vals_2 = [0.50,0.50,0.50,0.50,0.50,0.75,0.75,0.25,0.75,0.75] data_1 = pd.DataFrame({"time":times_1,"vals":vals_1}) data_2 = pd.DataFrame({"time":times_2,"vals":vals_2}) ############################################################################# shared_time = 0 #Keep running tally of shared time t1_ind = 0 #Pointer to row in data_1 dataframe t2_ind = 0 #Pointer to row in data_2 dataframe #Loop through both dataframes once, incrementing either the t1 or t2 index #Stop one before the end of both since do +1 indexing in loop while t1_ind < len(data_1.time)-1 and t2_ind < len(data_2.time)-1: #Get val1 and val2 val1,val2 = data_1.vals[t1_ind], data_2.vals[t2_ind] #Get the start and stop of the current time window t1_start,t1_stop = data_1.time[t1_ind], data_1.time[t1_ind+1] t2_start,t2_stop = data_2.time[t2_ind], data_2.time[t2_ind+1] #If the start of time window 2 is in time window 1 if val1 == val2 and (t1_start <= t2_start <= t1_stop): shared_time += (min(t1_stop,t2_stop)-t2_start).total_seconds() t1_ind += 1 #If the start of time window 1 is in time window 2 elif val1 == val2 and t2_start <= t1_start <= t2_stop: shared_time += (min(t1_stop,t2_stop)-t1_start).total_seconds() t2_ind += 1 #If there is no time window overlap and time window 2 is larger elif t1_start < t2_start: t1_ind += 1 #If there is no time window overlap and time window 1 is larger else: t2_ind += 1 #How I calculated the maximum possible shared time (not pretty) shared_start = max(data_1.time[0],data_2.time[0]) shared_stop = min(data_1.time.iloc[-1],data_2.time.iloc[-1]) max_possible_shared = (shared_stop-shared_start).total_seconds() #Print output print "Shared time:",shared_time print "Total possible shared:",max_possible_shared print "Percent shared:",shared_time*100/max_possible_shared,"%"

Output:

 Shared time: 17.000521 Total possible shared: 49.999907 Percent shared: 34.0011052421 %

+3

mitoRibo Oct 6 '16 at 1:39

source share

piRSquared · Accepted Answer · 2016-10-11T20:43:21+0000

Customization
create 2 time series

 from StringIO import StringIO import pandas as pd txt1 = """2016-10-05 11:50:02.000734 0.50 2016-10-05 11:50:03.000033 0.25 2016-10-05 11:50:10.000479 0.50 2016-10-05 11:50:15.000234 0.25 2016-10-05 11:50:37.000199 0.50 2016-10-05 11:50:49.000401 0.50 2016-10-05 11:50:51.000362 0.25 2016-10-05 11:50:53.000424 0.75 2016-10-05 11:50:53.000982 0.25 2016-10-05 11:50:58.000606 0.75""" s1 = pd.read_csv(StringIO(txt1), sep='\s{2,}', engine='python', parse_dates=[0], index_col=0, header=None, squeeze=True).rename('s1').rename_axis(None) txt2 = """2016-10-05 11:50:07.000537 0.50 2016-10-05 11:50:11.000994 0.50 2016-10-05 11:50:19.000181 0.50 2016-10-05 11:50:35.000578 0.50 2016-10-05 11:50:46.000761 0.50 2016-10-05 11:50:49.000295 0.75 2016-10-05 11:50:51.000835 0.75 2016-10-05 11:50:55.000792 0.25 2016-10-05 11:50:55.000904 0.75 2016-10-05 11:50:57.000444 0.75""" s2 = pd.read_csv(StringIO(txt2), sep='\s{2,}', engine='python', parse_dates=[0], index_col=0, header=None, squeeze=True).rename('s2').rename_axis(None)

TL; DR

 df = pd.concat([s1, s2], axis=1).ffill().dropna() overlap = df.index.to_series().diff().shift(-1) \ .fillna(0).groupby(df.s1.eq(df.s2)).sum() overlap.div(overlap.sum()) False 0.666657 True 0.333343 Name: duration, dtype: float64

explanation

build base pd.DataFrame df

use pd.concat to align indexes
use ffill to push the values forward
use dropna to get rid of the values of one series before another run

 df = pd.concat([s1, s2], axis=1).ffill().dropna() df

calculate 'duration'
from current timestamp to next

 df['duration'] = df.index.to_series().diff().shift(-1).fillna(0) df

calculate overlap

df.s1.eq(df.s2) gives a boolean sequence when s1 overlaps with s2
use groupby over boolean series to sum the total duration when True and False

 overlap = df.groupby(df.s1.eq(df.s2)).duration.sum() overlap False 00:00:33.999548 True 00:00:17.000521 Name: duration, dtype: timedelta64[ns]

percentage of time with the same value

 overlap.div(overlap.sum()) False 0.666657 True 0.333343 Name: duration, dtype: float64

The most efficient way to determine overlap periods in Python

More articles: