R: Summation of adjacent matrix elements. How to speed up?

I work with large matrices about 2500x2500x50 (lonxlatxtime) in size. The matrix contains only 1 and 0. I need to know for each mark the sum of 24 surrounding elements. So far I have been doing this along this path:

xdim <- 2500 ydim <- 2500 tdim <- 50 a <- array(0:1,dim=c(xdim,ydim,tdim)) res <- array(0:1,dim=c(xdim,ydim,tdim)) for (t in 1:tdim){ for (x in 3:(xdim-2)){ for (y in 3:(ydim-2)){ res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t]) } } } 

It works, but it is too slow for my needs. Can anyone advise how to speed things up?

+5
source share
3 answers

Here is one solution quick for a large array:

 res <- apply(a, 3, function(a) t(filter(t(filter(a, rep(1, 5), circular=TRUE)), rep(1, 5), circular=TRUE))) dim(res) <- c(xdim, ydim, tdim) 

I filtered the array using rep(1,5) as weights (i.e., the sum values ​​in neighborhood 2) along each dimension. Then I changed the dim attribute as it initially comes out as a matrix.

Please note that this wraps the sum around the edges of the array (which may make sense since you are looking at latitude and longitude, and if not, I can change my answer).



For a specific example:

 xdim <- 500 ydim <- 500 tdim <- 15 a <- array(0:1,dim=c(xdim,ydim,tdim)) 

and here is what you are currently using (with NA at the edges) and how long this example takes my laptop:

 f1 <- function(a, xdim, ydim, tdim){ res <- array(NA_integer_,dim=c(xdim,ydim,tdim)) for (t in 1:tdim){ for (x in 3:(xdim-2)){ for (y in 3:(ydim-2)){ res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t]) } } } return(res) } system.time(res1 <- f1(a, xdim, ydim, tdim)) # user system elapsed # 14.813 0.005 14.819 

And here is a comparison with the version I described:

 f2 <- function(a, xdim, ydim, tdim){ res <- apply(a, 3, function(a) t(filter(t(filter(a, rep(1, 5), circular=TRUE)), rep(1, 5), circular=TRUE))) dim(res) <- c(xdim, ydim, tdim) return(res) } system.time(res2 <- f2(a, xdim, ydim, tdim)) # user system elapsed # 1.188 0.047 1.236 

You can see a significant speedup (for large arrays). And check that it gives the correct solution (note that I add NA so that both results match, since the one I gave the filters in a circular way):

 ## Match NAs res2NA <- ifelse(is.na(res1), NA, res2) all.equal(res2NA, res1) # [1] TRUE 

I will add that your full array (2500x2500x50) took a little less than a minute (about 55 seconds), although there was a lot of memory in this process, FYI.

+2
source

Introduction

I have to say that so much is hidden behind tuning arrays. The rest of the problem is trivial. As a result, there are two ways to do this truly:

  • Bruteforce set by @Alex (written in C ++)
  • Observation of replication patterns

Bruteforce with OpenMP

If we want to use brute force, then we can use @Alex's suggestion to use OpenMP with Armadillo

 #include <RcppArmadillo.h> // [[Rcpp::depends(RcppArmadillo)]] // Add a flag to enable OpenMP at compile time // [[Rcpp::plugins(openmp)]] // Protect against compilers without OpenMP #ifdef _OPENMP #include <omp.h> #endif // [[Rcpp::export]] arma::cube cube_parallel(arma::cube a, arma::cube res, int cores = 1) { // Extract the different dimensions unsigned int tdim = res.n_slices; unsigned int xdim = res.n_rows; unsigned int ydim = res.n_cols; // Same calculation loop #pragma omp parallel for num_threads(cores) for (unsigned int t = 0; t < tdim; t++){ // pop the T arma::mat temp_mat = a.slice(t); // Subset the rows for (unsigned int x = 2; x < xdim-2; x++){ arma::mat temp_row_sub = temp_mat.rows(x-2, x+2); // Iterate over the columns with unit accumulative sum for (unsigned int y = 2; y < ydim-2; y++){ res(x,y,t) = accu(temp_row_sub.cols(y-2,y+2)); } } } return res; } 

Replication patterns

However, a more reasonable approach is to understand how array(0:1, dims) .

Most noticeably:

  • Case 1: if xdim even, then only the rows of the matrix change.
  • Case 2: If xdim is odd and ydim is odd, then rows alternate, and matrices alternate.
  • Case 3: If xdim is odd and ydim is equal, then only alternate strings

Examples

Look at the actions in action to observe the patterns.

Case 1:

 xdim <- 2 ydim <- 3 tdim <- 2 a <- array(0:1,dim=c(xdim,ydim,tdim)) 

Output

 , , 1 [,1] [,2] [,3] [1,] 0 0 0 [2,] 1 1 1 , , 2 [,1] [,2] [,3] [1,] 0 0 0 [2,] 1 1 1 

Case 2:

 xdim <- 3 ydim <- 3 tdim <- 3 a <- array(0:1,dim=c(xdim,ydim,tdim)) 

Output:

 , , 1 [,1] [,2] [,3] [1,] 0 1 0 [2,] 1 0 1 [3,] 0 1 0 , , 2 [,1] [,2] [,3] [1,] 1 0 1 [2,] 0 1 0 [3,] 1 0 1 , , 3 [,1] [,2] [,3] [1,] 0 1 0 [2,] 1 0 1 [3,] 0 1 0 

Case 3:

 xdim <- 3 ydim <- 4 tdim <- 2 a <- array(0:1,dim=c(xdim,ydim,tdim)) 

Output:

 , , 1 [,1] [,2] [,3] [,4] [1,] 0 1 0 1 [2,] 1 0 1 0 [3,] 0 1 0 1 , , 2 [,1] [,2] [,3] [,4] [1,] 0 1 0 1 [2,] 1 0 1 0 [3,] 0 1 0 1 

Hack pattern

Alrighty, based on the discussion above, we decided to do some code to use this unique template.

Create alternating vectors

In this case, the variable vector switches between two different values.

 #include <RcppArmadillo.h> // [[Rcpp::depends(RcppArmadillo)]] // ------- Make Alternating Vectors arma::vec odd_vec(unsigned int xdim){ // make a temporary vector to create alternating 0-1 effect by row. arma::vec temp_vec(xdim); // Alternating vector (anyone have a better solution? ) for (unsigned int i = 0; i < xdim; i++) { temp_vec(i) = (i % 2 ? 0 : 1); } return temp_vec; } arma::vec even_vec(unsigned int xdim){ // make a temporary vector to create alternating 0-1 effect by row. arma::vec temp_vec(xdim); // Alternating vector (anyone have a better solution? ) for (unsigned int i = 0; i < xdim; i++) { temp_vec(i) = (i % 2 ? 1 : 0); // changed } return temp_vec; } 

Creating Three Matrix Cases

As mentioned above, there are three cases of a matrix. Even, first odd and second odd cases.

 // --- Handle the different cases // [[Rcpp::export]] arma::mat make_even_matrix(unsigned int xdim, unsigned int ydim){ arma::mat temp_mat(xdim,ydim); temp_mat.each_col() = even_vec(xdim); return temp_mat; } // xdim is odd and ydim is even // [[Rcpp::export]] arma::mat make_odd_matrix_case1(unsigned int xdim, unsigned int ydim){ arma::mat temp_mat(xdim,ydim); arma::vec e_vec = even_vec(xdim); arma::vec o_vec = odd_vec(xdim); // Alternating column for (unsigned int i = 0; i < ydim; i++) { temp_mat.col(i) = (i % 2 ? o_vec : e_vec); } return temp_mat; } // xdim is odd and ydim is odd // [[Rcpp::export]] arma::mat make_odd_matrix_case2(unsigned int xdim, unsigned int ydim){ arma::mat temp_mat(xdim,ydim); arma::vec e_vec = even_vec(xdim); arma::vec o_vec = odd_vec(xdim); // Alternating column for (unsigned int i = 0; i < ydim; i++) { temp_mat.col(i) = (i % 2 ? e_vec : o_vec); // slight change } return temp_mat; } 

Calculation engine

Same as the previous solution, without t , since we no longer need to repeat the calculations.

 // --- Calculation engine // [[Rcpp::export]] arma::mat calc_matrix(arma::mat temp_mat){ unsigned int xdim = temp_mat.n_rows; unsigned int ydim = temp_mat.n_cols; arma::mat res = temp_mat; // Subset the rows for (unsigned int x = 2; x < xdim-2; x++){ arma::mat temp_row_sub = temp_mat.rows(x-2, x+2); // Iterate over the columns with unit accumulative sum for (unsigned int y = 2; y < ydim-2; y++){ res(x,y) = accu(temp_row_sub.cols(y-2,y+2)); } } return res; } 

Main call function

Here is the main function that brings everything together. This gives us the required remote arrays.

 // --- Main Engine // Create the desired cube information // [[Rcpp::export]] arma::cube dim_to_cube(unsigned int xdim = 4, unsigned int ydim = 4, unsigned int tdim = 3) { // Initialize values in A arma::cube res(xdim,ydim,tdim); if(xdim % 2 == 0){ res.each_slice() = calc_matrix(make_even_matrix(xdim, ydim)); }else{ if(ydim % 2 == 0){ res.each_slice() = calc_matrix(make_odd_matrix_case1(xdim, ydim)); }else{ arma::mat first_odd_mat = calc_matrix(make_odd_matrix_case1(xdim, ydim)); arma::mat sec_odd_mat = calc_matrix(make_odd_matrix_case2(xdim, ydim)); for(unsigned int t = 0; t < tdim; t++){ res.slice(t) = (t % 2 ? sec_odd_mat : first_odd_mat); } } } return res; } 

Timing

Now, however, how well this happens:

 Unit: microseconds expr min lq mean median uq max neval r_1core 3538.022 3825.8105 4301.84107 3957.3765 4043.0085 16856.865 100 alex_1core 2790.515 2984.7180 3461.11021 3076.9265 3189.7890 15371.406 100 cpp_1core 174.508 180.7190 197.29728 194.1480 204.8875 338.510 100 cpp_2core 111.960 116.0040 126.34508 122.7375 136.2285 162.279 100 cpp_3core 81.619 88.4485 104.54602 94.8735 108.5515 204.979 100 cpp_cache 40.637 44.3440 55.08915 52.1030 60.2290 302.306 100 

Script is used for synchronization:

 cpp_parallel = cube_parallel(a,res, 1) alex_1core = alex(a,res,xdim,ydim,tdim) cpp_cache = dim_to_cube(xdim,ydim,tdim) op_answer = cube_r(a,res,xdim,ydim,tdim) all.equal(cpp_parallel, op_answer) all.equal(cpp_cache, op_answer) all.equal(alex_1core, op_answer) xdim <- 20 ydim <- 20 tdim <- 5 a <- array(0:1,dim=c(xdim,ydim,tdim)) res <- array(0:1,dim=c(xdim,ydim,tdim)) ga = microbenchmark::microbenchmark(r_1core = cube_r(a,res,xdim,ydim,tdim), alex_1core = alex(a,res,xdim,ydim,tdim), cpp_1core = cube_parallel(a,res, 1), cpp_2core = cube_parallel(a,res, 2), cpp_3core = cube_parallel(a,res, 3), cpp_cache = dim_to_cube(xdim,ydim,tdim)) 
+7
source

There is a lot of overhead from redundant subsets and calculations in your current code. Clean it if you want a higher speed.

  • In xdim <- ydim <- 20; tdim <- 5 xdim <- ydim <- 20; tdim <- 5 I see acceleration on my machine by 23%.
  • In xdim <- ydim <- 200; tdim <- 10 xdim <- ydim <- 200; tdim <- 10 I see acceleration by 25%.

This is due to the low cost of additional memory, which is obvious when considering the code below.

 xdim <- ydim <- 20; tdim <- 5 a <- array(0:1,dim=c(xdim,ydim,tdim)) res <- array(0:1,dim=c(xdim,ydim,tdim)) microbenchmark(op= { for (t in 1:tdim){ for (x in 3:(xdim-2)){ for (y in 3:(ydim-2)){ res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t]) } } } }, alex= { for (t in 1:tdim){ temp <- a[,,t] for (x in 3:(xdim-2)){ temp2 <- temp[(x-2):(x+2),] for (y in 3:(ydim-2)){ res[x,y,t] <- sum(temp2[,(y-2):(y+2)]) } } } }, times = 50) Unit: milliseconds expr min lq mean median uq max neval cld op 4.855827 5.134845 5.474327 5.321681 5.626738 7.463923 50 b alex 3.720368 3.915756 4.213355 4.012120 4.348729 6.320481 50 a 

Further improvements:

  • If you write this in C ++ , I assume that recognition res[x,y,t] = res[x,y-1,t] - sum(a[...,y-2,...]) + sum(a[...,y+2,...]) save you extra time. In R, it was not in my time tests.
  • This issue is also awkwardly parallel. There is no reason why you could not split the dimension t to make more use of the multi-core architecture.

Both are left to the / OP reader.

+1
source

All Articles