Aggregate rows in a large matrix by rowname

I would like to combine matrix rows by adding values ​​to rows having the same rowname. My current approach is as follows:

> M abcd 1 1 1 2 0 1 2 3 4 2 2 3 0 1 2 3 4 2 5 2 > index <- as.numeric(rownames(M)) > M <- cbind(M,index) > Dfmat <- data.frame(M) > Dfmat <- aggregate(. ~ index, data = Dfmat, sum) > M <- as.matrix(Dfmat) > rownames(M) <- M[,"index"] > M <- subset(M, select= -index) > M abcd 1 3 4 6 2 2 3 0 1 2 3 4 2 5 2 

The problem with this estimate is that I need to apply it to a series of very large matrices (up to 1000 rows and 30,000 columns). In these cases, the computation time is very long (the same problem when using ddply). Is there a more efficient solution? Does this help the original DocumentTermMatrix input matrices from the tm package? As far as I know, they are stored in a sparse matrix format.

+7
source share
3 answers

Here's a solution using by and colSums , but requires some grunts due to the default output by .

 M <- matrix(1:9,3) rownames(M) <- c(1,1,2) t(sapply(by(M,rownames(M),colSums),identity)) V1 V2 V3 1 3 9 15 2 3 6 9 
+6
source

James's answer works as expected, but rather slow for large matrices. Here is a version that avoids creating new objects :

 combineByRow <- function(m) { m <- m[ order(rownames(m)), ] ## keep track of previous row name prev <- rownames(m)[1] i.start <- 1 i.end <- 1 ## cache the rownames -- profiling shows that it takes ## forever to look at them m.rownames <- rownames(m) stopifnot(all(!is.na(m.rownames))) ## go through matrix in a loop, as we need to combine some unknown ## set of rows for (i in 2:(1+nrow(m))) { curr <- m.rownames[i] ## if we found a new row name (or are at the end of the matrix), ## combine all rows and mark invalid rows if (prev != curr || is.na(curr)) { if (i.start < i.end) { m[i.start,] <- apply(m[i.start:i.end,], 2, max) m.rownames[(1+i.start):i.end] <- NA } prev <- curr i.start <- i } else { i.end <- i } } m[ which(!is.na(m.rownames)),] } 

Testing shows that it is about 10 times faster than the answer using by (2 versus 20 seconds in this example):

 N <- 10000 m <- matrix( runif(N*100), nrow=N) rownames(m) <- sample(1:(N/2),N,replace=T) start <- proc.time() m1 <- combineByRow(m) print(proc.time()-start) start <- proc.time() m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity)) print(proc.time()-start) all(m1 == m2) 
+1
source

Matrix.utils has a summary function. This can accomplish what you want with one line of code and about 10 times faster than combineByRow and 100 times faster than by :

 N <- 10000 m <- matrix( runif(N*100), nrow=N) rownames(m) <- sample(1:(N/2),N,replace=T) > microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10) Unit: milliseconds expr min lq mean median uq max neval a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642 10 b <- combineByRow(m) 634.96542 689.54724 759.87833 732.37424 866.22673 923.15491 10 c <- aggregate.Matrix(m, row.names(m)) 42.26674 44.60195 53.62292 48.59943 67.40071 70.40842 10 > identical(as.vector(a),as.vector(c)) [1] TRUE 

EDIT: Frank is right, the ranks are somewhat faster than any of these solutions. You would like to use another one of these other functions only if you used Matrix , which is especially rare, or if you performed aggregation other than sum .

+1
source

All Articles