Speed ​​up the calculation of the rms median of each 3 column tuple

If I have a data frame as such:

df = data.frame(matrix(rnorm(100), 5000, 100)) 

I can use the following function to get each combination of three-term medians line by line:

 median_df = t(apply(df, 1, combn, 3, median)) 

The problem is that it takes several hours to complete this function. The culprit is median (), which takes about ten times longer than max () or min ().

How can I speed up this function, perhaps by writing a faster version of the median () or working with the source data in different ways?

Update:

If I run the above code, but only for df [, 1:10] as such:

 median_df = t(apply(df[,1:10], 1, combn, 3, median)) 

takes 29 seconds

 fastMedian_df = t(apply(df[,1:10], 1, combn, 3, fastMedian)) 

from ccaPP package takes 6.5 seconds

 max_df = t(apply(df[,1:10], 1, combn, 3, max)) 

takes 2.5 seconds

Thus, we see a significant improvement with fastMedian (). Can we get better?

+7
performance function r dataframe median
source share
1 answer

One approach to speeding things up is to note that the median of three numbers is their sum minus their maximum minus their min. This means that we can vectorize our median calculations by processing each triple of columns once (by performing the median for all rows in the same calculation) instead of processing it once for each row.

 set.seed(144) # Fully random matrix df = matrix(rnorm(50000), 5000, 10) original <- function(df) t(apply(df, 1, combn, 3, median)) josilber <- function(df) { combos <- combn(seq_len(ncol(df)), 3) apply(combos, 2, function(x) rowSums(df[,x]) - pmin(df[,x[1]], df[,x[2]], df[,x[3]]) - pmax(df[,x[1]], df[,x[2]], df[,x[3]])) } system.time(res.josilber <- josilber(df)) # user system elapsed # 0.117 0.009 0.149 system.time(res.original <- original(df)) # user system elapsed # 15.107 1.864 16.960 all.equal(res.josilber, res.original) # [1] TRUE 

Vectorization gives 110x acceleration when there are 10 columns and 5,000 rows. Unfortunately, I do not have a machine with enough memory to store 808.5 million numbers in the output for your complete example.

You can accelerate this by implementing the Rcpp function, which takes the matrix as the input vector representation (the so-called vector obtained by reading the matrix by columns) along with the number of rows and returns the median of each std::nth_element function largely depends on the std::nth_element , which is asymptotically linear in the number of elements on which you take the median. (Note that I am not averaging the average of two values ​​when I take the median of a vector of even length, instead I take the bottom of the two).

 library(Rcpp) cppFunction( "NumericVector vectorizedMedian(NumericVector x, int chunkSize) { const int n = x.size() / chunkSize; std::vector<double> input = Rcpp::as<std::vector<double> >(x); NumericVector res(n); for (int i=0; i < n; ++i) { std::nth_element(input.begin()+i*chunkSize, input.begin()+i*chunkSize+chunkSize/2, input.begin()+(i+1)*chunkSize); res[i] = input[i*chunkSize+chunkSize/2]; } return res; }") 

Now we just call this function instead of using rowSums , pmin and pmax :

 josilber.rcpp <- function(df) { combos <- combn(seq_len(ncol(df)), 3) apply(combos, 2, function(x) vectorizedMedian(as.vector(t(df[,x])), 3)) } system.time(josilber.rcpp(df)) # user system elapsed # 0.049 0.008 0.081 all.equal(josilber(df), josilber.rcpp(df)) # [1] TRUE 

As a result, we get 210x acceleration; 110x acceleration is the transition from a non-vectorized median application to a vectorized application, and the remaining 2x acceleration is the transition from a combination of rowSums , pmin and pmax to calculate the median in a vectorized way of the Rcpp based approach.

+14
source share

All Articles