Re-fetch Timeseries

Question

Re-fetch Timeseries

I have a dataset of the following form dropbox download (23kb csv)

The data sampling rate varies from the second to the second from 0 Hz to more than 200 Hz, in most cases the maximum sampling rate in the data set is about 50 samples per second.

When samples are taken, they are always even scattered over the second, for example

time x 2012-12-06 21:12:40 128.75909883327378 2012-12-06 21:12:40 32.799224301545976 2012-12-06 21:12:40 98.932953779777989 2012-12-06 21:12:43 132.07033814856786 2012-12-06 21:12:43 132.07033814856786 2012-12-06 21:12:43 65.71691352191452 2012-12-06 21:12:44 117.1350194748169 2012-12-06 21:12:45 13.095622561808861 2012-12-06 21:12:47 61.295242676059246 2012-12-06 21:12:48 94.774064119961352 2012-12-06 21:12:49 80.169378222553533 2012-12-06 21:12:49 80.291142695702533 2012-12-06 21:12:49 136.55650749231367 2012-12-06 21:12:49 127.29790925838365

it should be

 time x 2012-12-06 21:12:40 000ms 128.75909883327378 2012-12-06 21:12:40 333ms 32.799224301545976 2012-12-06 21:12:40 666ms 98.932953779777989 2012-12-06 21:12:43 000ms 132.07033814856786 2012-12-06 21:12:43 333ms 132.07033814856786 2012-12-06 21:12:43 666ms 65.71691352191452 2012-12-06 21:12:44 000ms 117.1350194748169 2012-12-06 21:12:45 000ms 13.095622561808861 2012-12-06 21:12:47 000ms 61.295242676059246 2012-12-06 21:12:48 000ms 94.774064119961352 2012-12-06 21:12:49 000ms 80.169378222553533 2012-12-06 21:12:49 250ms 80.291142695702533 2012-12-06 21:12:49 500ms 136.55650749231367 2012-12-06 21:12:49 750ms 127.29790925838365

Is there an easy way to use the timeseries pandas resampling function, or is there some kind of thing built into numpy or scipy that will work?

+4

python numpy pandas time-series

lab_notes Dec 9 '12 at 1:15

source share

2 answers

I found this an excellent use case for the pandas groupby mechanism, so I also wanted to offer a solution for this. I find it a little more understandable than Andy's decision, but in reality it is not much less.

 # First, get your data into a dataframe after having copied # it with the mouse into a multi-line string: import pandas as pd from StringIO import StringIO s = """2012-12-06 21:12:40 128.75909883327378 2012-12-06 21:12:40 32.799224301545976 2012-12-06 21:12:40 98.932953779777989 2012-12-06 21:12:43 132.07033814856786 2012-12-06 21:12:43 132.07033814856786 2012-12-06 21:12:43 65.71691352191452 2012-12-06 21:12:44 117.1350194748169 2012-12-06 21:12:45 13.095622561808861 2012-12-06 21:12:47 61.295242676059246 2012-12-06 21:12:48 94.774064119961352 2012-12-06 21:12:49 80.169378222553533 2012-12-06 21:12:49 80.291142695702533 2012-12-06 21:12:49 136.55650749231367 2012-12-06 21:12:49 127.29790925838365""" sio = StringIO(s) df = pd.io.parsers.read_csv(sio, parse_dates=[[0,1]], sep='\s*', header=None) df = df.set_index('0_1') df.index.name = 'time' df.columns = ['x']

Until now, this has only been data preparation, so if you want to compare the length of the solutions, do it from now on !;)

 # Now, groupby the same time indices: grouped = df.groupby(df.index) # Create yourself a second object from datetime import timedelta second = timedelta(seconds=1) # loop over group elements, catch new index parts in list l = [] for _,group in grouped: size = len(group) if size == 1: # go to pydatetime for later addition, so that list is all in 1 format l.append(group.index.to_pydatetime()) else: offsets = [i * second / size for i in range(size)] l.append(group.index.to_pydatetime() + offsets) # exchange index for new index import numpy as np df.index = pd.DatetimeIndex(np.concatenate(l))

+2

K.-Michael Aye Feb 28 '13 at 1:24

source share

Andy hayden · Accepted Answer · 2012-12-09T02:15:26+0000

I don’t think there is a built-in pandas or numpy method / function for this.

However, I would prefer to use a python generator:

 def repeats(lst): i_0 = None n = -1 # will still work if lst starts with None for i in lst: if i == i_0: n += 1 else: n = 0 yield n i_0 = i # list(repeats([1,1,1,2,2,3])) == [0,1,2,0,1,0]

Then you can put this generator in a numpy array :

 import numpy as np df['rep'] = np.array(list(repeats(df['time'])))

Count the repetitions:

 from collections import Counter count = Counter(df['time']) df['count'] = df['time'].apply(lambda x: count[x])

and do the calculation (this is the most expensive part of the calculation):

 df['time2'] = df.apply(lambda row: (row['time'] + datetime.timedelta(0, 1) # 1s * row['rep'] / row['count']), axis=1)

Note. To remove calculation columns, use del df['rep'] and del df['count'] .

.

One “built-in” way to do this can be done with shift twice, but I think it will be a little messier ...

Re-fetch Timeseries

More articles: