R: Efficient placement of time series segments with maximum cross-correlation with the input segment?

Question

R: Efficient placement of time series segments with maximum cross-correlation with the input segment?

I have long digital time series data of approximately 200,000 rows (let's call it Z ).

In the loop, I multiply x (about 30) consecutive lines from Z at a time and consider them as a query point q ,

I want to find in Z the y segments (~ 300) of the most correlated time series of length x (the most correlated with q ).

What is an effective way to achieve this?

+8

r time-series subset correlation

Mike furlender Feb 02 '12 at 5:56

source share

2 answers

The naive solution is really very slow (at least a few minutes - I'm not patient enough):

 library(zoo) n <- 2e5 k <- 30 z <- rnorm(n) x <- rnorm(k) # We do not use the fact that x is a part of z rollapply(z, k, function(u) cor(u,x), align="left")

You can calculate the correlation manually, starting from the first moments and comments, but it will take several minutes.

 y <- zoo(rnorm(n), 1:n) x <- rnorm(k) exy <- exx <- eyy <- ex <- ey <- zoo( rep(0,n), 1:n ) for(i in 1:k) { cat(i, "\n") exy <- exy + lag(y,i-1) * x[i] ey <- ey + lag(y,i-1) eyy <- eyy + lag(y,i-1)^2 ex <- ex + x[i] # Constant time series exx <- exx + x[i]^2 # Constant time series } exy <- exy/k ex <- ex/k ey <- ey/k exx <- exx/k eyy <- eyy/k covxy <- exy - ex * ey vx <- exx - ex^2 vy <- eyy - ey^2 corxy <- covxy / sqrt( vx * vy )

Once you have the time series of correlations, it is easy to extract the position of the top 300.

 i <- order(corxy, decreasing=TRUE)[1:300] corxy[i]

+3

Vincent zoonekynd Feb 02 '12 at 6:37

source share

Josh o'brien · Accepted Answer · 2012-02-05T00:49:54+0000

In the code below, find the 300 segments you are looking for and works after 8 seconds on my not-so-powerful Windows laptop, so it should be fast enough for your purposes.

First, he builds a 30-by-199971 ( Zmat ) Zmat , whose columns contain all 30-segment time series segments that you want to study. One call to cor() , working on the q vector and Zmat matrix, then calculates all the required correlation coefficients. Finally, the resulting vector is examined to identify 300 sequences having the highest correlation coefficients.

 # Simulate data nZ <- 200000 nq <- 30 Z <- rnorm(nZ) q <- seq_len(nq) # From Z, construct a 30 by 199971 matrix, in which each column is a # "time series segment". Column 1 contains observations 1:30, column 2 # contains observations 2:31, and so on through the end of the series. Zmat <- sapply(seq_len(nZ - nq + 1), FUN = function(X) Z[seq(from = X, length.out = nq)]) # Calculate the correlation of q with every column/"time series segment. Cors <- cor(q, Zmat) # Extract the starting position of the 300 most highly correlated segments ids <- order(Cors, decreasing=TRUE)[1:300] # Maybe try something like the following to confirm that you have # selected the most highly correlated segments. hist(Cors, breaks=100) hist(Cors[ids], col="red", add=TRUE)

R: Efficient placement of time series segments with maximum cross-correlation with the input segment?

More articles: