Subset slowly in a large matrix

I have a 5,000,000 length numerical vector

>head(coordvec) [1] 47286545 47286546 47286547 47286548 47286549 472865 

and a 3 x 1,400,000 number matrix

 >head(subscores) V1 V2 V3 1 47286730 47286725 0.830 2 47286740 47286791 0.065 3 47286750 47286806 -0.165 4 47288371 47288427 0.760 5 47288841 47288890 0.285 6 47288896 47288945 0.225 

What I'm trying to accomplish is that for each number in the coordinate, find the average value of V3 for the rows in the counts in which V1 and V2 span the number in the coordinate. To do this, I use the following approach:

 results<-numeric(length(coordvec)) for(i in 1:length(coordvec)){ select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i] scores_subset <- subscores[select_rows, 3] results[m]<-mean(scores_subset) } 

This is very slow and takes several days to complete. Is there a faster way?

Thanks,

Dan

+4
source share
3 answers

I think that there are two difficult parts to this issue. The first is the search for floors. I would use the IRanges package from Bioconductor ( ?findInterval in the base package might also be useful)

 library(IRanges) 

creating a width of 1 range representing the coordinate vector and a set of ranges representing estimates; I sort coordinate vectors for convenience, assuming duplicate coordinates can be treated equally

 coord <- sort(sample(.Machine$integer.max, 5000000)) starts <- sample(.Machine$integer.max, 1200000) scores <- runif(length(starts)) q <- IRanges(coord, width=1) s <- IRanges(starts, starts + 100L) 

Here we find that query overlaps that subject

 system.time({ olaps <- findOverlaps(q, s) }) 

It takes about 7 seconds on my laptop. There are different types of overlaps (see ?findOverlaps ), so this step may need a little refinement. The result is a pair of vectors indexing the query and an overlapping object.

 > olaps Hits of length 281909 queryLength: 5000000 subjectLength: 1200000 queryHits subjectHits <integer> <integer> 1 19 685913 2 35 929424 3 46 1130191 4 52 37417 

I think this is the end of the first difficult part, finding the overlaps 281909. (I do not think that in the answer data.table suggested elsewhere, this applies, although I may be wrong ...)

The next difficult part is the calculation of a large amount of funds. The inline method would be similar to

 olaps0 <- head(olaps, 10000) system.time({ res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean) }) 

which takes about 3.25 s on my computer and apparently scales linearly, so maybe 90 seconds for 280 thousand overlaps. But I think we can efficiently execute this table using data.table . The initial coordinates start(v)[queryHits(olaps)] , since

 require(data.table) dt <- data.table(coord=start(q)[queryHits(olaps)], score=scores[subjectHits(olaps)]) res1 <- dt[,mean(score), by=coord]$V1 

which takes about 2.5 seconds for all 280k floors.

Some extra speed may be required, recognizing that the request requests are ordered. We want to calculate the average value for each query run. We start by creating a variable that points to the end of each query run.

 idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE) 

and then calculate the aggregate scores at the ends of each run, the length of each run, and the difference between the total score at the end and at the beginning of the run

 scoreHits <- cumsum(scores[subjectHits(olaps)])[idx] n <- diff(c(0L, seq_along(idx)[idx])) xt <- diff(c(0L, scoreHits)) 

And finally, the average

 res2 <- xt / n 

This takes about 0.6 s for all data and is identical (albeit more mysterious than?) To the result of data.table

 > identical(res1, res2) [1] TRUE 

The source coordinates corresponding to the tool are

 start(q)[ queryHits(olaps)[idx] ] 
+6
source

Something like this might be faster:

 require(data.table) subscores <- as.data.table(subscores) subscores[, cond := V1 < coordvec & V2 > coordvec] subscores[list(cond)[[1]], mean(V3)] 

list(cond)[[1]] , because: "When I am the only variable name, it is not considered an expression of column names and is instead calculated when the scope is called." source ?data.table

+2
source

Since your answer is not easily reproducible, and even if so, none of your subscores matches your logical state, I'm not sure if this is exactly what you are looking for, but you can use one of apply and a function.

 myfun <- function(x) { y <- subscores[, 1] < x & subscores[, 2] > x mean(subscores[y, 3]) } sapply(coordvec, myfun) 

You can also look mclapply . If you have enough memory, this will probably speed up the process significantly. However, you can also see the foreach package with similar results. You have a โ€œcorrectโ€ for loop assigning to results rather than increasing it, but you are actually comparing a lot . It will be difficult to speed it up.

0
source

All Articles