Most efficient / vectorization when using the previous calculated value (collapse)

After these conversations:

I wanted to test a more โ€œrealโ€ case study. I recently had to migrate SAS code to R code and kdb to R code. I tried to compile a simple but more sophisticated example for optimization.

create a training set

buildDF <- function(N){ set.seed(123); dateTimes <- sort(as.POSIXct("2001-01-01 08:30:00") + floor(3600*runif(N))); set.seed(124); f <- floor(1+3*runif(N)); set.seed(123); s <- floor(1+3*runif(N)); return(data.frame(dateTime=dateTimes, f=f, s=s)); } 

This is what you need to do.

 f1 <- function(DF){ #init N <- nrow(DF); DF$num[1] = 1; for(i in 2:N){ if(DF$f[i] == 2){ DF$num[i] <- ifelse(DF$s[i-1] == DF$s[i],DF$num[i-1],1+DF$num[i-1]); }else{ #meaning f in {1,3} if(DF$f[i-1] != 2){ DF$num[i] = DF$num[i-1]; }else{ DF$num[i] = ifelse((DF$dateTime[i]-DF$dateTime[i-1])==0,DF$num[i-1],1+DF$num[i-1]); } } } return(DF) } 

It's disgusting. Let it be a little vectorized:

 f2 <- function(DF){ N <- nrow(DF); DF$add <- 1; DF$ds <- c(NA,diff(DF$s)); DF$lf <- c(NA,DF$f[1:(N-1)]); DF$dt <- c(NA,diff(DF$dateTime)); DF$add[DF$f == 2 & DF$ds == 0] <- 0; DF$add[DF$f == 2 & DF$ds != 0] <- 1; DF$add[DF$f != 2 & DF$lf != 2] <- 0; DF$add[DF$f != 2 & DF$lf == 2 & DF$dt==0] <- 0; DF$num <- cumsum(DF$add); return(DF); } 

And using the most useful data.table :

 f3 <- function(DT){ N <- nrow(DT); DT[,add:=1]; DT[,ds:=c(NA,diff(s))]; DT[,lf:=c(NA,f[1:(N-1)])]; DT[,dt:=c(NA,diff(dateTime))]; DT[f == 2 & ds == 0, add:=0]; DT[f == 2 & ds != 0, add:=1]; DT[f != 2 & lf != 2, add:=0]; DT[f != 2 & lf == 2 & dt == 0, add:=0]; DT[,num:=cumsum(add)]; return(DT); } 

In a 10K data frame:

 library(rbenchmark); library(data.table); N <- 1e4; DF <- buildDF(N) DT <- as.data.table(DF);#we can contruct the data.table as a data.frame so it ok we don't count for this time. #make sure everybody is equal DF1 <- f1(DF) ; DF2 <- f2(DF); DT3 <- f3(DT); identical(DF1$num,DF2$num,DT3$num) [1] TRUE #let benchmark benchmark(f1(DF),f2(DF),f3(DT),columns=c("test", "replications", "elapsed", + "relative", "user.self", "sys.self"), order="relative",replications=1); test replications elapsed relative user.self sys.self 2 f2(DF) 1 0.010 1.0 0.012 0.000 3 f3(DT) 1 0.012 1.2 0.012 0.000 1 f1(DF) 1 9.085 908.5 8.980 0.072 

Ok, now on more decent 5M lines of data.frame

 N <- 5e6; DF <- buildDF(N) DT <- as.data.table(DF); benchmark(f2(DF),f3(DT),columns=c("test", "replications", "elapsed", + "relative", "user.self", "sys.self"), order="relative",replications=1); test replications elapsed relative user.self sys.self 2 f3(DT) 1 2.843 1.000 2.092 0.624 1 f2(DF) 1 10.920 3.841 4.016 5.137 

We get 5X with data.table.

I wonder if Rcpp or zoo: rollapply can win a lot. I would be pleased with any suggestion

+6
source share
2 answers

Simple embedded version of Rcpp:

 library(Rcpp) library(inline) f4cxx <- cxxfunction(signature(input="data.frame"), plugin="Rcpp", body=' Rcpp::DataFrame df(input); const int N = df.nrows(); Rcpp::NumericVector f = df["f"]; Rcpp::NumericVector s = df["s"]; Rcpp::NumericVector d = df["dateTime"]; // As far as we need only comparation Rcpp::NumericVector num(N); // it is safe to convert Datetime to Numeric (faster) num[0] = 1; for(int i=1; i<N; i++){ bool cond1 = (f[i]==2) && (s[i]!=s[i-1]); bool cond2 = (f[i]!=2) && (f[i-1]==2) && (d[i]!=d[i-1]); num[i] = (cond1 || cond2)?1+num[i-1]:num[i-1]; } df["num"] = num; return df; // Returns list //return (Rcpp::as<Rcpp::DataFrame>(df)); // Returns data.frame (slower) ') 

Verification:

 N<-1e4; df<-buildDF(N) identical(f1(df)$num, f4cxx(df)$num) [1] TRUE 

Benchmarking:

 N<-1e5; df<-buildDF(N); dt<-as.data.table(df) benchmark(f2(df), f2j(df), f3(dt), f4cxx(df), columns=c("test", "replications", "elapsed", "relative", "user.self", "sys.self"), order="relative", replications=1); test replications elapsed relative user.self sys.self 4 f4cxx(df) 1 0.001 1 0.000 0 2 f2j(df) 1 0.037 37 0.040 0 3 f3(dt) 1 0.058 58 0.056 0 1 f2(df) 1 0.078 78 0.076 0 
+6
source

Converting your first function to C / C ++ using Rcpp (or the โ€œplain old C APIโ€ if you want Dirk to scratch his head) is likely to be the fastest. The data.table solution is likely to be a close second.

Here's a basic R-solution, which is much faster than your f2 function, because it avoids a lot of subsets of data.frame (which is very slow). This illustrates what to do / avoid in order to make the basic R-code fast, but at a certain cost of legibility / readability of the code.

 f2j <- function(DF){ N <- nrow(DF) f2 <- DF$f == 2 ds0 <- diff(DF$s) == 0 lf2 <- f2[-N] f2 <- f2[-1] dt3 <- diff(DF$dateTime) == 0 cond <- logical(N) cond[-1] <- (f2 & ds0) | (!f2 & !lf2) | (!f2 & lf2 & dt3) DF$num <- cumsum(!cond) DF } 
+3
source

All Articles