Operations on multidimensional arrays in R: apply vs data.table vs plyr (in parallel)

Question

Operations on multidimensional arrays in R: apply vs data.table vs plyr (in parallel)

In my research work, I usually deal with large 4D arrays (20-200 million elements). I am trying to improve the computational speed of my calculations by looking for the optimal compromise between speed and simplicity. I already took some step forward thanks to SO (see here and here )

Now I am trying to use the latest packages like data.table and plyr .

Let's start with something like:

 D = c(100, 1000, 8) #x,y,t d = array(rnorm(prod(D)), dim = D)

For each x (first dimension) and y (second dimension), I would like to get t values that are above the 90th percentile. Let do it with base R:

 system.time( q1 <- apply(d, c(1,2), function(x) { return(x >= quantile(x, .9, names = F)) }) )

On my Macbook, this is about ten seconds. And I return the array as:

 > dim(q1) [1] 8 100 1000

( apply weird to reorder, anyway I don't care). Now I can melt ( reshape2 package) my array and use it in data.table :

 > d_m = melt(d) > colnames(d_m) = c('x', 'y', 't', 'value') > d_t = data.table(d_m)

Then I do some data.table "magic":

 system.time({ q2 = d_t[,q := quantile(value, .9, names = F), by="x,y"][,ev := value > q] })

Now the calculation takes a little less than 10 seconds. Now I want to try with plyr and ddply :

 system.time({ q3 <- ddply(d_m, .(x, y), summarise, q = quantile(value, .9, names = F)) })

Now it takes 60 seconds. If I go to dplyr , I can do the same calculation in ten seconds.

However, my question is this: what would you do for the same calculation faster? If I consider a large matrix (say, 20 times larger), I get a faster calculation using data.table by the apply function, but, nevertheless, in the same order (14 minutes versus 10 minutes). Any comment really appreciated ...

EDIT

I implemented the quantile function in C ++ using Rcpp , speeding up the calculation eight times.

+2

r multidimensional-array data.table plyr

Matteo de felice Oct 16 '14 at 9:12

source share

1 answer

Matteo de felice · Answer 1 · 2014-10-16T14:45:14+0000

As suggested by @roland, one of the possible solutions to speed up the code was to implement a faster version of the quantile function. I spent one hour to learn how to do this using Rcpp , and the runtime was reduced eight times. I implemented the type 7 version of the quantile algorithm (default selection). We are still far from MATLAB's performance (discussed here ), but in my case this is an impressive step forward. I am not proud of the Rcpp code I wrote so far; I did not have time to polish it. Anyway, it works (I checked the results using the R function), so if you are interested, you can download it from here .

Operations on multidimensional arrays in R: apply vs data.table vs plyr (in parallel)

More articles: