Adding consecutive four / n numbers to a large matrix in R

Question

Adding consecutive four / n numbers to a large matrix in R

I have a very large dataset with a size of 60K x 4 K I am trying to add every four values in a row in each column of a row. The following is an example dataset.

  set.seed(123) mat <- matrix (sample(0:1, 48, replace = TRUE), 4) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [1,] 0 1 1 1 0 1 1 0 1 1 0 0 [2,] 1 0 0 1 0 1 1 0 1 0 0 0 [3,] 0 1 1 0 0 1 1 1 0 0 0 0 [4,] 1 1 0 1 1 1 1 1 0 0 0 0

Here is what I am trying to accomplish:

 mat[1,1] + mat[1,2] + mat[1,3] + mat[1,4] = 0 + 1 + 1 + 1 = 3

i.e. add every four values and output.

 mat[1,5] + mat[1,6] + mat[1,7] + mat[1,8] = 0 + 1 + 1 + 0 = 2

Continue to complete the matrix (up to 12 here).

 mat[1,9] + mat[1,10] + mat[1,11] + mat[1,12]

Once the first line is completed, apply it to the second line, for example:

 mat[2,1] + mat[2,2] + mat[2,3] + mat[2,4] mat[2,5] + mat[2,6] + mat[2,7] + mat[2,8] mat[2,9] + mat[2,10] + mat[2,11] + mat[2,12]

The result will be nrow x (ncol)/4 .

The expected result will look like this:

  col1-col4 col5-8 col9-12 row1 3 2 2 row2 2 2 1 row3 2 3 0 row4 3 4 0

Similarly, for row 3, the number of rows in the matrix. How can I loop this efficiently.

+7

loops r large-data

SHRram Aug 27 '14 at 18:45

source share

4 answers

Dividing a matrix into a 3D array is one way:

 apply(array(mat, dim=c(4, 4, 3)), MARGIN=c(1,3), FUN=sum) # [,1] [,2] [,3] # [1,] 3 2 2 # [2,] 2 2 1 # [3,] 2 3 0 # [4,] 3 4 0

+8

Matthew plourde Aug 27 '14 at 18:54

source share

Here's a different approach using the RcppRoll package

 library(RcppRoll) # Uses C++/Rcpp n <- 4 # The summing range roll_sum(mat, n, by.column = F)[, seq_len(ncol(mat) - n + 1) %% n == 1] ## [,1] [,2] [,3] ## [1,] 3 2 2 ## [2,] 2 2 1 ## [3,] 2 3 0 #3 [4,] 3 4 0

+5

David Arenburg Aug 27 '14 at 20:46

source share

This may be the slowest of all:

 set.seed(123) mat <- matrix (sample(0:1, 48, replace = TRUE), 4) mat output <- sapply(seq(4,ncol(mat),4), function(i) { apply(mat,1,function(j){ sum(j[c(i-3, i-2, i-1, i)], na.rm=TRUE) })}) output [,1] [,2] [,3] [1,] 3 2 2 [2,] 2 2 1 [3,] 2 3 0 [4,] 3 4 0

Nested for-loops may be slower, but this answer is pretty close to nested for-loops .

+1

Mark miller Aug 27 '14 at 22:45

source share

Brodieg · Accepted Answer · 2014-08-27T21:28:39+0000

While Matthew's answer is really cool (+1, by the way), you can get a much faster solution (~ 100x) if you avoid apply and use the *Sums functions (in this case colSums ) and the vector manipulation bit:

 funSums <- function(mat) { t.mat <- t(mat) # rows become columns dim(t.mat) <- c(4, length(t.mat) / 4) # wrap columns every four items (this is what we want to sum) t(matrix(colSums(t.mat), nrow=ncol(mat) / 4)) # sum our new 4 element columns, and reconstruct desired output format } set.seed(123) mat <- matrix(sample(0:1, 48, replace = TRUE), 4) funSums(mat)

It produces the desired result:

  [,1] [,2] [,3] [1,] 3 2 2 [2,] 2 2 1 [3,] 2 3 0 [4,] 3 4 0

Now let's make the actual size and compare with other parameters:

 set.seed(123) mat <- matrix(sample(0:1, 6e5, replace = TRUE), 4) funApply <- function(mat) { # Matthew Solution apply(array(mat, dim=c(4, 4, ncol(mat) / 4)), MARGIN=c(1,3), FUN=sum) } funRcpp <- function(mat) { # David Solution roll_sum(mat, 4, by.column = F)[, seq_len(ncol(mat) - 4 + 1)%%4 == 1] } library(microbenchmark) microbenchmark(times=10, funSums(mat), funApply(mat), funRcpp(mat) )

It produces:

 Unit: milliseconds expr min lq median uq max neval funSums(mat) 4.035823 4.079707 5.256517 7.5359 42.06529 10 funApply(mat) 379.124825 399.060015 430.899162 455.7755 471.35960 10 funRcpp(mat) 18.481184 20.364885 38.595383 106.0277 132.93382 10

And check:

 all.equal(funSums(mat), funApply(mat)) # [1] TRUE all.equal(funSums(mat), funRcpp(mat)) # [1] TRUE

The key point is that *Sums functions are completely “vectorized”, since all calculations are performed in C. apply , you still need to do a bunch of not strictly vectorized (in primitive mode, C functions) in R, and slower (but much more flexible).

Specifically for this problem, it would be possible to do this 2-3 times faster, since about half the time is spent on transpositions, which are necessary only for the dim changes to do what I need for colSums to work.

Adding consecutive four / n numbers to a large matrix in R

More articles: