Fast (er) way to index a matrix in R

Question

Fast (er) way to index a matrix in R

First of all, I'm looking for a quick (er) way to subset / index a matrix many times:

for (i in 1:99000) { subset.data <- data[index[, i], ] }

Background:
I am performing a sequential testing procedure, including loading in R. Wanting to replicate some of the simulation results, I came across this bottleneck where you need to do a lot of indexing. To implement the bootstrap block, I created an index matrix with which I will multiply the original data matrix to obtain repeated data samples.

 # The basic setup B <- 1000 # no. of bootstrap replications n <- 250 # no. of observations m <- 100 # no. of models/data series # Create index matrix with B columns and n rows. # Each column represents a resampling of the data. # (actually block resamples, but doesn't matter here). boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B) # Make matrix with m data series of length n. sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m) subsetMatrix <- function(data, index) { # fn definition for timing subset.data <- data[index, ] return(subset.data) } # check how long it takes. Rprof("subsetMatrix.out") for (i in 1:(m - 1)) { for (b in 1:B) { # B * (m - 1) = 1000 * 99 = 99000 boot.data <- subsetMatrix(sample.data, boot.index[, b]) # do some other stuff } # do some more stuff } Rprof() summaryRprof("subsetMatrix.out") # > summaryRprof("subsetMatrix.out") # $by.self # self.time self.pct total.time total.pct # subsetMatrix 9.96 100 9.96 100 # In the actual application: ######### # > summaryRprof("seq_testing.out") # $by.self # self.time self.pct total.time total.pct # subsetMatrix 6.78 53.98 6.78 53.98 # colMeans 1.98 15.76 2.20 17.52 # makeIndex 1.08 8.60 2.12 16.88 # makeStats 0.66 5.25 9.66 76.91 # runif 0.60 4.78 0.72 5.73 # apply 0.30 2.39 0.42 3.34 # is.data.frame 0.22 1.75 0.22 1.75 # ceiling 0.18 1.43 0.18 1.43 # aperm.default 0.14 1.11 0.14 1.11 # array 0.12 0.96 0.12 0.96 # estimateMCS 0.10 0.80 12.56 100.00 # as.vector 0.10 0.80 0.10 0.80 # matrix 0.08 0.64 0.08 0.64 # lapply 0.06 0.48 0.06 0.48 # / 0.04 0.32 0.04 0.32 # : 0.04 0.32 0.04 0.32 # rowSums 0.04 0.32 0.04 0.32 # - 0.02 0.16 0.02 0.16 # > 0.02 0.16 0.02 0.16 # # $by.total # total.time total.pct self.time self.pct # estimateMCS 12.56 100.00 0.10 0.80 # makeStats 9.66 76.91 0.66 5.25 # subsetMatrix 6.78 53.98 6.78 53.98 # colMeans 2.20 17.52 1.98 15.76 # makeIndex 2.12 16.88 1.08 8.60 # runif 0.72 5.73 0.60 4.78 # doTest 0.68 5.41 0.00 0.00 # apply 0.42 3.34 0.30 2.39 # aperm 0.26 2.07 0.00 0.00 # is.data.frame 0.22 1.75 0.22 1.75 # sweep 0.20 1.59 0.00 0.00 # ceiling 0.18 1.43 0.18 1.43 # aperm.default 0.14 1.11 0.14 1.11 # array 0.12 0.96 0.12 0.96 # as.vector 0.10 0.80 0.10 0.80 # matrix 0.08 0.64 0.08 0.64 # lapply 0.06 0.48 0.06 0.48 # unlist 0.06 0.48 0.00 0.00 # / 0.04 0.32 0.04 0.32 # : 0.04 0.32 0.04 0.32 # rowSums 0.04 0.32 0.04 0.32 # - 0.02 0.16 0.02 0.16 # > 0.02 0.16 0.02 0.16 # mean 0.02 0.16 0.00 0.00 # # $sample.interval # [1] 0.02 # # $sampling.time # [1] 12.56'

The sequential testing procedure takes about 10 seconds. Using this in simulations with 2500 repetitions and several parametric constellations, it will take about 40 days. Using parallel processing and better processor power, you can do it faster, but still not very nice: /

Is there a better way to reprogram the data / get rid of the loop?
Can be applied, inserted into vectors, replicated, etc. go anywhere?
Does it make sense to implement a subset in C (e.g. manipulate some pointers)?

Despite the fact that every step has already been taken incredibly fast R, it is simply not fast enough.
I would be very happy for any answer / help / advice!

related issues:
- A quick subset of matrices through '[': row by row, column by column, or doesn't it matter?
- quick function for creating bootstrap samples in matrix forms in R
- random sampling - matrix

from there

 mapply(function(row) return(sample.data[row,]), row = boot.index) replicate(B, apply(sample.data, 2, sample, replace = TRUE))

didn't really do it for me.

+7

r simulation matrix-indexing statistics-bootstrap

Niels Dec 08 '13 at 18:26

source share

1 answer

flodel · Accepted Answer · 2013-12-08T23:56:14+0000

I rewrote makeStats and makeIndex as they were two of the biggest bottlenecks:

 makeStats <- function(data, index) { data.mean <- colMeans(data) m <- nrow(data) n <- ncol(index) tabs <- lapply(1L:n, function(j)tabulate(index[, j], nbins = m)) weights <- matrix(unlist(tabs), m, n) * (1 / nrow(index)) boot.data.mean <- t(data) %*% weights - data.mean return(list(data.mean = data.mean, boot.data.mean = boot.data.mean)) } makeIndex <- function(B, blocks){ n <- ncol(blocks) l <- nrow(blocks) z <- ceiling(n/l) start.points <- sample.int(n, z * B, replace = TRUE) index <- blocks[, start.points] keep <- c(rep(TRUE, n), rep(FALSE, z*l - n)) boot.index <- matrix(as.vector(index)[keep], nrow = n, ncol = B) return(boot.index) }

This reduced the computation time from 28 to 6 seconds on my machine. I am sure there are other parts of the code that can be improved (including my use of lapply / tabulate above.)

Fast (er) way to index a matrix in R

More articles: