An elegant way to count the number of elements in each column of a matrix that are larger than in each column?

I currently have a solution that works. I am wondering if there is a more elegant approach?

First install:

set.seed(315) mat <- matrix(sample(1:5, 20, replace = TRUE), nrow = 4, ncol = 5) > mat [,1] [,2] [,3] [,4] [,5] [1,] 3 4 1 3 3 [2,] 5 3 5 1 4 [3,] 4 1 1 4 3 [4,] 3 3 1 1 1 

From this matrix I want to create a 5x5 output matrix, where the entry in i, j represents the number of elements in the column column j that were larger than the column I am an input matrix.

edit . I originally described a solution in which the i, j entry of the output solution is the number of elements in the column i am more than a column j, but the result is the opposite ratio. I changed the description to match the output, and perhaps all the differences in the responses received are the result of this.

This solution gives the desired result:

 mat.pm <- apply(mat, MARGIN = 2, function(x) { return(apply(mat, MARGIN = 2, function(y) { return(sum(x > y)) })) }) > mat.pm [,1] [,2] [,3] [,4] [,5] [1,] 0 1 0 0 0 [2,] 2 0 1 1 2 [3,] 3 2 0 2 2 [4,] 2 3 1 0 1 [5,] 3 2 1 1 0 

Is there a better way that doesn't include functions with two nested applications?

edit : here are a few steps for different approaches:

 library(microbenchmark) set.seed(315) bm_data <- matrix(sample(1:5, 6000, replace = TRUE), nrow = 200, ncol = 30) op <- microbenchmark( APPLY1 = apply(bm_data, MARGIN = 2, function(x) { return(apply(bm_data, MARGIN = 2, function(y) { return(sum(x > y)) })) }), APPLY2 = apply(bm_data, 2 , function(x) colSums( x > bm_data)), SWEEP = apply(bm_data,2,function(x) colSums(sweep(bm_data,1,x,"-")<0)), VECTORIZE = { n <- 1:ncol(bm_data); ind <- expand.grid(n, n) out <- colSums(bm_data[,c(ind[,2])] > bm_data[,c(ind[,1])]) }, SAPPLY = sapply(seq(ncol(bm_data)), function(i) colSums(bm_data[, i] > bm_data)), times = 100L ) > summary(op) expr min lq median uq max neval 1 APPLY1 9742.091 10519.757 10923.896 11876.614 13006.850 100 2 APPLY2 1012.097 1080.926 1148.111 1247.965 3338.314 100 3 SWEEP 7020.979 7667.972 8580.420 8943.674 33601.336 100 4 VECTORIZE 3036.700 3266.815 3516.449 4476.769 28638.246 100 5 SAPPLY 978.812 1021.754 1078.461 1150.782 3303.798 100 
Strategies

@Ricardo SAPPLY and @Simon APPLY2 are nice single-line solutions that work much faster than my APPLY1 approach. In terms of elegance, updating @Simon with APPLY2 falls into the mark - simple, easy to read, and fast.

One conclusion I learned from the discussion here is how much faster the apply functions are copied across the matrix compared to data.frame . Convert and then calculate if possible.

@Simon expand.grid is the most creative - I didn’t even think of approaching the problem this way. Nice.

+6
source share
5 answers

Edit:

See below for more details, but you can do this in a single apply loop:

 apply( mat , 2 , function(x) colSums( x > mat ) 

apply is fast because it is optimized for working with matrices. Most of the time spent using apply usually involves converting data.frame to a matrix that is not needed here.


Original:

This can be done completely because > has a method for matrices and data.frame s. Therefore, you can compare column indices with expand.grid() , use this to subset the matrix, perform a logical comparison, and then use colSums to get the result and matrix to wrap it back to the right size. All this in 4 lines (actually it could be 2):

 n <- 1:ncol(mat) ind <- expand.grid(n,n) out <- colSums( mat[,c(ind[,1])] > mat[,c(ind[,2])] ) matrix( out , ncol(mat) , byrow = TRUE ) # [,1] [,2] [,3] [,4] [,5] #[1,] 0 1 0 0 0 #[2,] 2 0 1 1 2 #[3,] 3 2 0 2 2 #[4,] 2 3 1 0 1 #[5,] 3 2 1 1 0 

Update:

apply seems even faster, and combining apply with @Ricardo comparing the entire matrix leads us to one, the fastest (?) apply solution, which is about 4 times faster than OP:

 # Single apply loop f1 <- function(mat) apply( mat , 2 , function(x) colSums( x > mat ) ) # OP double apply loop f2 <- function(mat) {apply(mat, MARGIN = 2, function(x) { return(apply(mat, MARGIN = 2, function(y) { return(sum(x > y)) }))})} require(microbenchmark) microbenchmark( f1(mat) , f2(mat) ) #Unit: microseconds # expr min lq median uq max neval # f1(mat) 95.190 97.6405 102.7145 111.4635 159.584 100 # f2(mat) 361.862 370.7860 398.7830 418.3810 1336.506 100 
+5
source

I think the results that you have are transposed:

 ## This gives you what you show as output sapply(seq(ncol(mat)), function(i) colSums(mat[, i] > mat)) ## This gives you what you _describe_ in the question t(sapply(seq(ncol(mat)), function(i) colSums(mat[, i] > mat))) [,1] [,2] [,3] [,4] [,5] [1,] 0 2 3 2 3 [2,] 1 0 2 3 2 [3,] 0 1 0 1 1 [4,] 0 1 2 0 1 [5,] 0 2 2 1 0 
+4
source

The method used here is only 1 apply , but replacing another with sweep , so I'm not sure what it considers:

 apply(mat,2,function(x) colSums(sweep(mat,1,x,"-")<0)) [,1] [,2] [,3] [,4] [,5] [1,] 0 1 0 0 0 [2,] 2 0 1 1 2 [3,] 3 2 0 2 2 [4,] 2 3 1 0 1 [5,] 3 2 1 1 0 
+1
source

Benchmarking:

  bigmat<-matrix(sample(0:5,200,rep=T),nr=10) gridfoo <- function(mat) { n <- 1:ncol(mat) ind <- expand.grid(n,n) out <- colSums( mat[,c(ind[,1])] > mat[,c(ind[,2])] ) } appfoo<- function(mat) apply(mat,2,function(x) colSums(sweep(mat,1,x,"-")<0)) app2foo<- function(mat) t(sapply(seq(ncol(mat)), function(i) colSums(mat[, i] > mat))) microbenchmark(gridfoo(bigmat),appfoo(bigmat),app2foo(bigmat),times=10) Unit: microseconds expr min lq median uq max neval gridfoo(bigmat) 363.909 369.895 381.4410 413.086 522.557 10 appfoo(bigmat) 1208.892 1231.129 1238.1850 1252.083 1521.913 10 app2foo(bigmat) 298.482 310.883 317.0835 323.284 762.454 10 

But ... (time difference)

 Rgames> bigmat<-matrix(sample(0:5,20000,rep=T),nr=100) Rgames> microbenchmark(gridfoo(bigmat),appfoo(bigmat),app2foo(bigmat),times=10) Unit: milliseconds expr min lq median uq max neval gridfoo(bigmat) 106.15115 112.98458 149.5746 183.87987 249.35418 10 appfoo(bigmat) 127.44553 127.92874 132.5372 136.42562 199.12123 10 app2foo(bigmat) 14.64483 14.99676 18.6089 20.51824 20.91122 10 
+1
source

This works, but may not scale much due to the heavy merging step.

 library(reshape2) matmelted <- melt(mat) matmeltedcross<- merge(matmelted,matmelted,by = 'Var1', allow.cartesian = TRUE) matmeltedcross$count <- matmeltedcross$value.x > matmeltedcross$value.y mat.pm <- with( matmeltedcross[matmeltedcross$count == TRUE,], table(Var2.y,Var2.x) ) 

Output -

 > mat.pm Var2.x Var2.y 1 2 3 4 5 1 0 1 0 0 0 2 2 0 1 1 2 3 3 2 0 2 2 4 2 3 1 0 1 5 3 2 1 1 0 
0
source

All Articles