In my research work, I usually deal with large 4D arrays (20-200 million elements). I am trying to improve the computational speed of my calculations by looking for the optimal compromise between speed and simplicity. I already took some step forward thanks to SO (see here and here )
Now I am trying to use the latest packages like data.table and plyr .
Let's start with something like:
D = c(100, 1000, 8) #x,y,t d = array(rnorm(prod(D)), dim = D)
For each x (first dimension) and y (second dimension), I would like to get t values ββthat are above the 90th percentile. Let do it with base R:
system.time( q1 <- apply(d, c(1,2), function(x) { return(x >= quantile(x, .9, names = F)) }) )
On my Macbook, this is about ten seconds. And I return the array as:
> dim(q1) [1] 8 100 1000
( apply weird to reorder, anyway I don't care). Now I can melt ( reshape2 package) my array and use it in data.table :
> d_m = melt(d) > colnames(d_m) = c('x', 'y', 't', 'value') > d_t = data.table(d_m)
Then I do some data.table "magic":
system.time({ q2 = d_t[,q := quantile(value, .9, names = F), by="x,y"][,ev := value > q] })
Now the calculation takes a little less than 10 seconds. Now I want to try with plyr and ddply :
system.time({ q3 <- ddply(d_m, .(x, y), summarise, q = quantile(value, .9, names = F)) })
Now it takes 60 seconds. If I go to dplyr , I can do the same calculation in ten seconds.
However, my question is this: what would you do for the same calculation faster? If I consider a large matrix (say, 20 times larger), I get a faster calculation using data.table by the apply function, but, nevertheless, in the same order (14 minutes versus 10 minutes). Any comment really appreciated ...
EDIT
I implemented the quantile function in C ++ using Rcpp , speeding up the calculation eight times.