I think that there are two difficult parts to this issue. The first is the search for floors. I would use the IRanges package from Bioconductor ( ?findInterval in the base package might also be useful)
library(IRanges)
creating a width of 1 range representing the coordinate vector and a set of ranges representing estimates; I sort coordinate vectors for convenience, assuming duplicate coordinates can be treated equally
coord <- sort(sample(.Machine$integer.max, 5000000)) starts <- sample(.Machine$integer.max, 1200000) scores <- runif(length(starts)) q <- IRanges(coord, width=1) s <- IRanges(starts, starts + 100L)
Here we find that query overlaps that subject
system.time({ olaps <- findOverlaps(q, s) })
It takes about 7 seconds on my laptop. There are different types of overlaps (see ?findOverlaps ), so this step may need a little refinement. The result is a pair of vectors indexing the query and an overlapping object.
> olaps Hits of length 281909 queryLength: 5000000 subjectLength: 1200000 queryHits subjectHits <integer> <integer> 1 19 685913 2 35 929424 3 46 1130191 4 52 37417
I think this is the end of the first difficult part, finding the overlaps 281909. (I do not think that in the answer data.table suggested elsewhere, this applies, although I may be wrong ...)
The next difficult part is the calculation of a large amount of funds. The inline method would be similar to
olaps0 <- head(olaps, 10000) system.time({ res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean) })
which takes about 3.25 s on my computer and apparently scales linearly, so maybe 90 seconds for 280 thousand overlaps. But I think we can efficiently execute this table using data.table . The initial coordinates start(v)[queryHits(olaps)] , since
require(data.table) dt <- data.table(coord=start(q)[queryHits(olaps)], score=scores[subjectHits(olaps)]) res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5 seconds for all 280k floors.
Some extra speed may be required, recognizing that the request requests are ordered. We want to calculate the average value for each query run. We start by creating a variable that points to the end of each query run.
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the aggregate scores at the ends of each run, the length of each run, and the difference between the total score at the end and at the beginning of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx] n <- diff(c(0L, seq_along(idx)[idx])) xt <- diff(c(0L, scoreHits))
And finally, the average
res2 <- xt / n
This takes about 0.6 s for all data and is identical (albeit more mysterious than?) To the result of data.table
> identical(res1, res2) [1] TRUE
The source coordinates corresponding to the tool are
start(q)[ queryHits(olaps)[idx] ]