Effectively building GRanges / IRanges from an Rle vector

I have a coded path length vector representing a certain value at each position in the genome, in order. As an example of a toy, suppose I had only one chromosome 10 in length, then I would have a vector similar to

library(GenomicRanges) set.seed(1) toyData = Rle(sample(1:3,10,replace=TRUE)) 

I would like to force this to a GRanges object. The best I can come up with is

 gr = GRanges('toyChr',IRanges(cumsum(c(0,runLength(toyData)[-nrun(toyData)])), width=runLength(toyData)), toyData = runValue(toyData)) 

which works, but rather slow. Is there a faster way to build the same object?

+7
r bioinformatics bioconductor run-length-encoding iranges
source share
1 answer

As @TheUnfunCat noted, the OP solution is pretty solid. The solution below is only moderately faster than the original solution. I tried almost any combination of base R and could not beat the effectiveness of Rle from the S4Vectors package, so I resorted to Rcpp . Here is the main function:

 GenomeRcpp <- function(v) { x <- WhichDiffZero(v) m <- v[c(1L,x+1L)] s <- c(0L,x) e <- c(x,length(v))-1L GRanges('toyChr',IRanges(start = s, end = e), toyData = m) } 

WhichDiffZero is an Rcpp function that pretty much does the same as which(diff(v) != 0) in base R Great credit to @ G.Grothendieck .

 #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] IntegerVector WhichDiffZero(IntegerVector x) { int nx = x.size()-1; std::vector<int> y; y.reserve(nx); for(int i = 0; i < nx; i++) { if (x[i] != x[i+1]) y.push_back(i+1); } return wrap(y); } 

Below are some guidelines:

 set.seed(437) testData <- do.call(c,lapply(1:10^5, function(x) rep(sample(1:50, 1), sample(1:30, 1)))) microbenchmark(GenomeRcpp(testData), GenomeOrig(testData)) Unit: milliseconds expr min lq mean median uq max neval cld GenomeRcpp(testData) 20.30118 22.45121 26.59644 24.62041 27.28459 198.9773 100 a GenomeOrig(testData) 25.11047 27.12811 31.73180 28.96914 32.16538 225.1727 100 a identical(GenomeRcpp(testData), GenomeOrig(testData)) [1] TRUE 

I have been working on this for the past few days, and I am definitely not satisfied. I hope someone takes what I did (as this is a different approach) and create something much better.

+3
source share

All Articles