Perform in-line distribution test efficiently

I have a matrix in which each row is a sample from a distribution. I want to do a rolling comparison of distributions using ks.test and save the test statistics in each case. The easiest way to implement this conceptually is through a loop:

 set.seed(1942) mt <- rbind(rnorm(5), rnorm(5), rnorm(5), rnorm(5)) results <- matrix(as.numeric(rep(NA, nrow(mt)))) for (i in 2 : nrow(mt)) { results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic } 

However, my real data has ~ 400 columns and ~ 300,000 rows for one example, and I have many examples. So I would like it to be fast. The Kolmogorov-Smirnov test is not so mathematically complicated, therefore, if the answer "implements it in Rcpp ", I would reluctantly agree with this, but I would be somewhat surprised - it is already very fast to calculate one pair in R.

The methods I tried but couldn't get the job done: dplyr using rowwise/do/lag , zoo using rollapply (this is what I use to generate distributions), and filling data.table into a loop (editing: this one works, but it still slow).

+8
optimization r rollapply
source share
4 answers

Fast and dirty implementation in Rcpp

 // [[Rcpp::depends(RcppArmadillo)]] #include <RcppArmadillo.h> double KS(arma::colvec x, arma::colvec y) { int n = x.n_rows; arma::colvec w = join_cols(x, y); arma::uvec z = arma::sort_index(w); w.fill(-1); w.elem( find(z <= n-1) ).ones(); return max(abs(cumsum(w)))/n; } // [[Rcpp::export]] Rcpp::NumericVector K_S(arma::mat mt) { int n = mt.n_cols; Rcpp::NumericVector results(n); for (int i=1; i<n;i++) { arma::colvec x=mt.col(i-1); arma::colvec y=mt.col(i); results[i] = KS(x, y); } return results; } 

for a matrix of size (400, 30000) , it completes within 1 s.

 system.time(K_S(t(mt)))[3] #elapsed # 0.98 

And the result seems accurate.

 set.seed(1942) mt <- matrix(rnorm(400*30000), nrow=30000) results <- rep(0, nrow(mt)) for (i in 2 : nrow(mt)) { results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic } result <- K_S(t(mt)) all.equal(result, results) #[1] TRUE 
+5
source share

One source of speedup is writing a smaller version of ks.test , which does less. ks.test2 below is more restrictive than ks.test . It is assumed, for example, that you have no missing values ​​and that you always want statistics to be associated with a two-way test.

 ks.test2 <- function(x, y){ nx <- length(x) ny <- length(y) w <- c(x, y) z <- cumsum(ifelse(order(w) <= nx, 1/nx, -1/ny)) max(abs(z)) } 

Make sure the output matches ks.test .

 set.seed(999) x <- rnorm(400) y <- rnorm(400) ks.test(x, y)$statistic D 0.045 ks.test2(x, y) [1] 0.045 

Now determine the savings from a smaller function:

 library(microbenchmark) microbenchmark( ks.test(x, y), ks.test2(x, y) ) Unit: microseconds expr min lq mean median uq max neval cld ks.test(x, y) 1030.238 1070.303 1347.3296 1227.207 1313.8490 6338.918 100 b ks.test2(x, y) 709.719 730.048 832.9532 833.861 888.5305 1281.284 100 a 
+3
source share

I was able to compute paired Kruskal-Wallis statistics using ks.test() with rollapplyr() .

 results <- rollapplyr(data = big, width = 2, FUN = function(x) ks.test(x[1, ], x[2, ])$statistic, by.column = FALSE) 

The expected result is obtained, but it is slow for a data set of your size. Slow slow deceleration. This may be due to the fact that ks.test() computes a lot more than just statistics at each iteration; it also gets p value and performs a big error check.

Indeed, if we model a large dataset like this:

 big <- NULL for (i in 1:400) { big <- cbind(big, rnorm(300000)) } 

The rollapplyr() solution takes a lot of time; I stopped execution after about 2 hours, after which he calculated almost all (but not all) of the results.

It seems that although rollapplyr() is likely faster than a for loop, this is unlikely to be the best overall solution in terms of performance.

+2
source share

Here's a dplyr solution that gets the same result as your loop. I have doubts if this is actually faster than the cycle, but perhaps it can serve as the first step towards a solution.

 require(dplyr) mt %>% as.data.frame %>% mutate_each(funs(lag)) %>% cbind(mt) %>% slice(-1) %>% rowwise %>% do({ x = unlist(.) n <- length(x) data.frame(ks = ks.test(head(x, n/2), tail(x, n/2))$statistic) }) %>% unlist %>% c(NA, .) %>% matrix 
+1
source share

All Articles