I currently have a solution that works. I am wondering if there is a more elegant approach?
First install:
set.seed(315) mat <- matrix(sample(1:5, 20, replace = TRUE), nrow = 4, ncol = 5) > mat [,1] [,2] [,3] [,4] [,5] [1,] 3 4 1 3 3 [2,] 5 3 5 1 4 [3,] 4 1 1 4 3 [4,] 3 3 1 1 1
From this matrix I want to create a 5x5 output matrix, where the entry in i, j represents the number of elements in the column column j that were larger than the column I am an input matrix.
edit . I originally described a solution in which the i, j entry of the output solution is the number of elements in the column i am more than a column j, but the result is the opposite ratio. I changed the description to match the output, and perhaps all the differences in the responses received are the result of this.
This solution gives the desired result:
mat.pm <- apply(mat, MARGIN = 2, function(x) { return(apply(mat, MARGIN = 2, function(y) { return(sum(x > y)) })) }) > mat.pm [,1] [,2] [,3] [,4] [,5] [1,] 0 1 0 0 0 [2,] 2 0 1 1 2 [3,] 3 2 0 2 2 [4,] 2 3 1 0 1 [5,] 3 2 1 1 0
Is there a better way that doesn't include functions with two nested applications?
edit : here are a few steps for different approaches:
library(microbenchmark) set.seed(315) bm_data <- matrix(sample(1:5, 6000, replace = TRUE), nrow = 200, ncol = 30) op <- microbenchmark( APPLY1 = apply(bm_data, MARGIN = 2, function(x) { return(apply(bm_data, MARGIN = 2, function(y) { return(sum(x > y)) })) }), APPLY2 = apply(bm_data, 2 , function(x) colSums( x > bm_data)), SWEEP = apply(bm_data,2,function(x) colSums(sweep(bm_data,1,x,"-")<0)), VECTORIZE = { n <- 1:ncol(bm_data); ind <- expand.grid(n, n) out <- colSums(bm_data[,c(ind[,2])] > bm_data[,c(ind[,1])]) }, SAPPLY = sapply(seq(ncol(bm_data)), function(i) colSums(bm_data[, i] > bm_data)), times = 100L ) > summary(op) expr min lq median uq max neval 1 APPLY1 9742.091 10519.757 10923.896 11876.614 13006.850 100 2 APPLY2 1012.097 1080.926 1148.111 1247.965 3338.314 100 3 SWEEP 7020.979 7667.972 8580.420 8943.674 33601.336 100 4 VECTORIZE 3036.700 3266.815 3516.449 4476.769 28638.246 100 5 SAPPLY 978.812 1021.754 1078.461 1150.782 3303.798 100
Strategies
@Ricardo SAPPLY and @Simon APPLY2 are nice single-line solutions that work much faster than my APPLY1 approach. In terms of elegance, updating @Simon with APPLY2 falls into the mark - simple, easy to read, and fast.
One conclusion I learned from the discussion here is how much faster the apply functions are copied across the matrix compared to data.frame . Convert and then calculate if possible.
@Simon expand.grid is the most creative - I didn’t even think of approaching the problem this way. Nice.