Create a fixed-interval dataset from a random-interval dataset using legacy data

Update:. I presented a brief analysis of the three answers at the bottom of the question text and explained my options.

My question is:. What is the most efficient method for constructing a fixed-interval dataset from a random-interval dataset using legacy data?

Some background: The above problem is a common problem in statistics. Often there is a sequence of observations occurring at random times. Name it Input . But everyone wants a sequence of observations to follow, say, every 5 minutes. Name it Output . One of the most common methods for constructing this data set is to use obsolete data, i.e. Each observation in Output is equal to the last observation observed in Input .

So, here is some code for creating example data sets:

 TInput = 100; TOutput = 50; InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1)); Input = [InputTimeStamp, randn(TInput, 1)]; OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)'; Output = [OutputTimeStamp, NaN(TOutput, 1)]; 

Both datasets begin around midnight around the turn of the millennium. However, time stamps in Input occur at random intervals, and time stamps in Output occur at fixed intervals. For simplicity, I made sure that the first observation in Input always occurs before the first observation in Output . Feel free to make this assumption in any answers.

I am currently solving the problem as follows:

 sMax = size(Output, 1); tMax = size(Input, 1); s = 1; t = 2; %#Loop over input data while t <= tMax if Input(t, 1) > Output(s, 1) %#If current obs in Input occurs after current obs in output then set current obs in output equal to previous obs in input Output(s, 2:end) = Input(t-1, 2:end); s = s + 1; %#Check if we've filled out all observations in output if s > sMax break end %#This step is necessary in case we need to use the same input observation twice in a row t = t - 1; end t = t + 1; if t > tMax %#If all remaining observations in output occur after last observation in input, then use last obs in input for all remaining obs in output Output(s:end, 2:end) = Input(end, 2:end); break end end 

Of course, is there a more efficient or at least more elegant way to solve this problem? As I mentioned, this is a common problem in statistics. Perhaps Matlab has a built-in function that I don't know about? Any help would be greatly appreciated as I use this LOT procedure for some large datasets.

ANSWERS: Hello everyone, I have analyzed three answers, and as they stand, Angainor is the best.

ChthonicDaemon's answer, although the easiest to implement, is very slow. This is true even when conversion to a timeseries object timeseries performed outside of the speed test. I assume that the resample function has a lot of overhead at the moment. I am launching 2011b, so it is possible that Mathworks has improved it over time. In addition, this method needs an extra line for the case where Output ends with more than one case after Input .

Rody's answer only works a little slower than Angainor (unsurprisingly considering both using the histc approach), however it seems to have some problems. Firstly, the method of assigning the last observation in Output not resistant to the last observation in Input that occurs after the last observation in Output . This is a simple solution. But there is a second problem, which, in my opinion, is due to the fact that InputTimeStamp is the first input to histc instead of the OutputTimeStamp adopted by Angainor. The problem arises if you change OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)'; on OutputTimeStamp = 730486.002 + (0:0.0001:TOutput * 0.0001 - 0.0001)'; when setting up example inputs.

Angainor seems resistant to everything that I threw at him, plus he was the fastest.

I have done many speed tests for different input specifications - the following numbers are quite representative:

My naive cycle: Elapsed time is 8.579535 seconds.

Angainor : Elapsed time is 0.661756 seconds.

Rody: Elapsed time is 0.913304 seconds.

ChthonicDaemon: Elapsed time is 22.916844 seconds.

I am a + 1-ing Angainor solution and note that the issue is resolved.

+6
source share
2 answers

Here is my solution to the problem. histc - path:

 % find Output timestamps in Input bins N = histc(Output(:,1), Input(:,1)); % find counts in the non-empty bins counts = N(find(N)); % find Input signal value associated with every bin val = Input(find(N),2); % now, replicate every entry entry in val % as many times as specified in counts index = zeros(1,sum(counts)); index(cumsum([1 counts(1:end-1)'])) = 1; index = cumsum(index); val_rep = val(index) % finish the signal with last entry from Input, as needed val_rep(end+1:size(Output,1)) = Input(end,2); % done Output(:,2) = val_rep; 

I checked your procedure for several different input models (I changed the number of output timestamps) and the results remained the same. However, I'm still not sure I understood your problem, so if something is wrong, let me know.

+1
source

This "obsolete data" approach is known as keeping zero order in signal and time series fields. Finding this quickly brings up a multitude of solutions. If you have Matlab 2012b, all of this is built into the timeseries class using the resample function, so you just do

 TInput = 100; TOutput = 50; InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1)); InputData = randn(TInput, 1); InputTimeSeries = timeseries(InputData, InputTimeStamp); OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001); OutputTimeSeries = resample(InputTimeSeries, OutputTimeStamp, 'zoh'); % zoh stands for zero order hold 
+2
source

Source: https://habr.com/ru/post/926806/


All Articles