Adding consecutive four / n numbers to a large matrix in R

I have a very large dataset with a size of 60K x 4 K I am trying to add every four values โ€‹โ€‹in a row in each column of a row. The following is an example dataset.

  set.seed(123) mat <- matrix (sample(0:1, 48, replace = TRUE), 4) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [1,] 0 1 1 1 0 1 1 0 1 1 0 0 [2,] 1 0 0 1 0 1 1 0 1 0 0 0 [3,] 0 1 1 0 0 1 1 1 0 0 0 0 [4,] 1 1 0 1 1 1 1 1 0 0 0 0 

Here is what I am trying to accomplish:

 mat[1,1] + mat[1,2] + mat[1,3] + mat[1,4] = 0 + 1 + 1 + 1 = 3 

i.e. add every four values โ€‹โ€‹and output.

 mat[1,5] + mat[1,6] + mat[1,7] + mat[1,8] = 0 + 1 + 1 + 0 = 2 

Continue to complete the matrix (up to 12 here).

 mat[1,9] + mat[1,10] + mat[1,11] + mat[1,12] 

Once the first line is completed, apply it to the second line, for example:

 mat[2,1] + mat[2,2] + mat[2,3] + mat[2,4] mat[2,5] + mat[2,6] + mat[2,7] + mat[2,8] mat[2,9] + mat[2,10] + mat[2,11] + mat[2,12] 

The result will be nrow x (ncol)/4 .

The expected result will look like this:

  col1-col4 col5-8 col9-12 row1 3 2 2 row2 2 2 1 row3 2 3 0 row4 3 4 0 

Similarly, for row 3, the number of rows in the matrix. How can I loop this efficiently.

+7
loops r large-data
source share
4 answers

While Matthew's answer is really cool (+1, by the way), you can get a much faster solution (~ 100x) if you avoid apply and use the *Sums functions (in this case colSums ) and the vector manipulation bit:

 funSums <- function(mat) { t.mat <- t(mat) # rows become columns dim(t.mat) <- c(4, length(t.mat) / 4) # wrap columns every four items (this is what we want to sum) t(matrix(colSums(t.mat), nrow=ncol(mat) / 4)) # sum our new 4 element columns, and reconstruct desired output format } set.seed(123) mat <- matrix(sample(0:1, 48, replace = TRUE), 4) funSums(mat) 

It produces the desired result:

  [,1] [,2] [,3] [1,] 3 2 2 [2,] 2 2 1 [3,] 2 3 0 [4,] 3 4 0 

Now let's make the actual size and compare with other parameters:

 set.seed(123) mat <- matrix(sample(0:1, 6e5, replace = TRUE), 4) funApply <- function(mat) { # Matthew Solution apply(array(mat, dim=c(4, 4, ncol(mat) / 4)), MARGIN=c(1,3), FUN=sum) } funRcpp <- function(mat) { # David Solution roll_sum(mat, 4, by.column = F)[, seq_len(ncol(mat) - 4 + 1)%%4 == 1] } library(microbenchmark) microbenchmark(times=10, funSums(mat), funApply(mat), funRcpp(mat) ) 

It produces:

 Unit: milliseconds expr min lq median uq max neval funSums(mat) 4.035823 4.079707 5.256517 7.5359 42.06529 10 funApply(mat) 379.124825 399.060015 430.899162 455.7755 471.35960 10 funRcpp(mat) 18.481184 20.364885 38.595383 106.0277 132.93382 10 

And check:

 all.equal(funSums(mat), funApply(mat)) # [1] TRUE all.equal(funSums(mat), funRcpp(mat)) # [1] TRUE 

The key point is that *Sums functions are completely โ€œvectorizedโ€, since all calculations are performed in C. apply , you still need to do a bunch of not strictly vectorized (in primitive mode, C functions) in R, and slower (but much more flexible).

Specifically for this problem, it would be possible to do this 2-3 times faster, since about half the time is spent on transpositions, which are necessary only for the dim changes to do what I need for colSums to work.

+9
source share

Dividing a matrix into a 3D array is one way:

 apply(array(mat, dim=c(4, 4, 3)), MARGIN=c(1,3), FUN=sum) # [,1] [,2] [,3] # [1,] 3 2 2 # [2,] 2 2 1 # [3,] 2 3 0 # [4,] 3 4 0 
+8
source share

Here's a different approach using the RcppRoll package

 library(RcppRoll) # Uses C++/Rcpp n <- 4 # The summing range roll_sum(mat, n, by.column = F)[, seq_len(ncol(mat) - n + 1) %% n == 1] ## [,1] [,2] [,3] ## [1,] 3 2 2 ## [2,] 2 2 1 ## [3,] 2 3 0 #3 [4,] 3 4 0 
+5
source share

This may be the slowest of all:

 set.seed(123) mat <- matrix (sample(0:1, 48, replace = TRUE), 4) mat output <- sapply(seq(4,ncol(mat),4), function(i) { apply(mat,1,function(j){ sum(j[c(i-3, i-2, i-1, i)], na.rm=TRUE) })}) output [,1] [,2] [,3] [1,] 3 2 2 [2,] 2 2 1 [3,] 2 3 0 [4,] 3 4 0 

Nested for-loops may be slower, but this answer is pretty close to nested for-loops .

+1
source share

All Articles