Effective subset in R using 2 data frames

I have a large full time series in one data frame and a list of timestamps in another test data frame. I need to multiply full with data points associated with timestamps in test . My first instinct (like R noob) was to write below what was wrong

 subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i)) 

Looking at the result, I realized that R goes through both vectors at the same time, giving the wrong result. My option is to write a loop as shown below:

 subs<-data.frame() for (j in test$dt) subs <- rbind(subs,subset(full,full$dt>(ji) & full$dt<(j+i))) 

I feel that there may be a better way to do loops, and in this article begs us to avoid R-loops as much as possible. Another reason is that I might run into performance issues, as that would be the core of the optimization algorithm. Any suggestions from the guru would be very helpful.

EDIT:

Here is some reproducible code that shows the wrong approach, as well as an approach that works, but could be better.

 #create a times series full <- data.frame(seq(1:200),rnorm(200,0,1)) colnames(full)<-c("dt","val") #my smaller array of points of interest test <- data.frame(seq(5,200,by=23)) colnames(test)<-c("dt") # my range around the points of interset i<-3 #the wrong approach subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i)) #this works, but not sure this is the best way to go about it subs<-data.frame() for (j in test$dt) subs <- rbind(subs,subset(full,full$dt>(ji) & full$dt<(j+i))) 

EDIT: I updated the values ​​to better reflect my usecase, and I see that the @mrdwab solution is moving forward unexpectedly and by a wide margin.

I am using the control code from @mrdwab and the initialization is as follows:

 set.seed(1) full <- data.frame( dt = 1:15000000, val = floor(rnorm(15000000,0,1)) ) test <- data.frame(dt = floor(runif(24,1,15000000))) i <- 500 

Criteria:

  test replications elapsed relative 2 mrdwab 2 1.31 1.00000 3 spacedman 2 69.06 52.71756 1 andrie 2 93.68 71.51145 4 original 2 114.24 87.20611 

Totally unexpected. Mind = blown up. Can someone shed light in this dark corner and talk about what is happening.

Important: As @mrdwab notes below, its solution only works if the vectors are integers. If not, @spacedman has the right solution

+4
source share
4 answers

I don't know if this was more efficient, but I think you could do something like this to get what you want:

 subs <- apply(test, 1, function(x) c((x-2):(x+2))) full[which(full$dt %in% subs), ] 

I had to set "3" to "2" since x will be included in both paths.

Benchmarking (just for fun)

@Spacedman leads!

First, the required data and functions.

 ## Data set.seed(1) full <- data.frame( dt = 1:200, val = rnorm(200,0,1) ) test <- data.frame(dt = seq(5,200,by=23)) i <- 3 ## Spacedman functions cf = function(l,u){force(l);force(u);function(x){x>l & x<u}} OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}} funs = mapply(cf,test$dt-i,test$dt+i) anyF = Reduce(OR,funs) 

Secondly, benchmarking.

 ## Benchmarking require(rbenchmark) benchmark(andrie = do.call(rbind, lapply(test$dt, function(j) full[full$dt > (ji) & full$dt < (j+i), ])), mrdwab = {subs <- apply(test, 1, function(x) c((x-(i-1)):(x+(i-1)))) full[which(full$dt %in% subs), ]}, spacedman = full[anyF(full$dt),], original = {subs <- data.frame() for (j in test$dt) subs <- rbind(subs, subset(full, full$dt > (ji) & full$dt < (j+i)))}, columns = c("test", "replications", "elapsed", "relative"), order = "relative") # test replications elapsed relative # 3 spacedman 100 0.064 1.000000 # 2 mrdwab 100 0.105 1.640625 # 1 andrie 100 0.520 8.125000 # 4 original 100 1.080 16.875000 
+4
source

Here is the real R-way to do this. Functionally. No cycles ...

Starting with the Andrie example data.

Firstly, the interval comparison function:

 > cf = function(l,u){force(l);force(u);function(x){x>l & x<u}} 

OR composition function:

 > OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}} 

Now there is some kind of cycle to build a list of these comparison functions:

 > funs = mapply(cf,test$dt-i,test$dt+i) 

Now combine all this into one function:

 > anyF = Reduce(OR,funs) 

And now we apply the OR composition to our interval test functions:

 > head(full[anyF(full$dt),]) dt val 3 3 -0.83562861 4 4 1.59528080 5 5 0.32950777 6 6 -0.82046838 7 7 0.48742905 26 26 -0.05612874 

You now have a single variable function that checks if a value is in the ranges you specify.

 > anyF(1:10) [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE 

I don’t know if it is faster, or better, or what. Someone is doing some tests!

+6
source

There is nothing wrong with the code. To achieve your goal, you need some kind of loop around the operation of a vectorized subset.

But here is more R-ish way to do this, which could be faster:

 do.call(rbind, lapply(test$dt, function(j)full[full$dt > (ji) & full$dt < (j+i), ]) ) 

PS: You can greatly simplify the reproducible example:

 set.seed(1) full <- data.frame( dt = 1:200, val = rnorm(200,0,1) ) test <- data.frame(dt = seq(5,200,by=23)) i <- 3 xx <- do.call(rbind, lapply(test$dt, function(j)full[full$dt > (ji) & full$dt < (j+i), ]) ) head(xx) dt val 3 3 -0.83562861 4 4 1.59528080 5 5 0.32950777 6 6 -0.82046838 7 7 0.48742905 26 26 -0.05612874 
+4
source

another way to use data.tables:

 { temp <- data.table(x=unique(c(full$dt,(test$dt-i),(test$dt+i))),key="x") temp[,index:=1:nrow(temp)] startpoints <- temp[J(test$dt-i),index]$index endpoints <- temp[J(test$dt+i),index]$index allpoints <- as.vector(mapply(FUN=function(x,y) x:y,x=startpoints,y=endpoints)) setkey(x=temp,index) ans <- temp[J(allpoints)]$x } 

criteria: number of lines in the test: 9 number of lines: 10000

  test replications elapsed relative 1 spacedman 100 0.406 1.000 2 new 100 1.179 2.904 

number of rows: 100,000

  test replications elapsed relative 2 new 100 2.374 1.000 1 spacedman 100 3.753 1.581 
0
source

All Articles