As @TheUnfunCat noted, the OP solution is pretty solid. The solution below is only moderately faster than the original solution. I tried almost any combination of base R and could not beat the effectiveness of Rle from the S4Vectors package, so I resorted to Rcpp . Here is the main function:
GenomeRcpp <- function(v) { x <- WhichDiffZero(v) m <- v[c(1L,x+1L)] s <- c(0L,x) e <- c(x,length(v))-1L GRanges('toyChr',IRanges(start = s, end = e), toyData = m) }
WhichDiffZero is an Rcpp function that pretty much does the same as which(diff(v) != 0) in base R Great credit to @ G.Grothendieck .
#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] IntegerVector WhichDiffZero(IntegerVector x) { int nx = x.size()-1; std::vector<int> y; y.reserve(nx); for(int i = 0; i < nx; i++) { if (x[i] != x[i+1]) y.push_back(i+1); } return wrap(y); }
Below are some guidelines:
set.seed(437) testData <- do.call(c,lapply(1:10^5, function(x) rep(sample(1:50, 1), sample(1:30, 1)))) microbenchmark(GenomeRcpp(testData), GenomeOrig(testData)) Unit: milliseconds expr min lq mean median uq max neval cld GenomeRcpp(testData) 20.30118 22.45121 26.59644 24.62041 27.28459 198.9773 100 a GenomeOrig(testData) 25.11047 27.12811 31.73180 28.96914 32.16538 225.1727 100 a identical(GenomeRcpp(testData), GenomeOrig(testData)) [1] TRUE
I have been working on this for the past few days, and I am definitely not satisfied. I hope someone takes what I did (as this is a different approach) and create something much better.
Joseph Wood
source share