Does the cycle order for speed in R require

I have a problem when I want to run a simulation study where the simulation depends on two variables x and y . x and y are potential value vectors that I want to evaluate (so different combinations) in my modeling study. In addition, for each combination of x and y I want several repetitions (since there is a stochastic term there, and each run of x and y will change).

To give an example of what I mean, I have the following simplified example:

 x = 1:10 y = 11:20 iterations = 2000 iter = 1 solution = array(NA,c(length(x),3,iterations)) for(i in x){ for(j in y){ for(k in 1:iterations){ z = rnorm(3) + c(i,j,1) solution[i,,k] = z } } } 

However, in my real problem, the code that evaluates inside the for loop is much less trivial to evaluate. However, the structure of my inputs is the same as the output.

So, what would I like to know, say, using the above example, it is most efficient to configure the loops in this order, or it would be better if k in 1:iterations was the outermost loop and tried to use some kind of outer() in this loop 1 , since I will evaluate the function ( z in this example) over the grid x and y ?

In addition, I am very open to a completely different setting and design. At the end of the day, I want to get a solution based on x and y and averaged over all iterations, i.e. apply(solution, c(1,2),mean)


Edit:

As I was suggested, here is the actual code that I use.

 library(survival) iter = 2000 n = 120 base = 2 hr = 0.5 mx = 3 my = mx/hr ANS = NULL for (vecX in c(0.3, 0.5, 0.6, 0.7)){ out = NULL for (vecY in c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95)){ mxp = mx/vecX myp = my/vecX mxn = mx myn = my nt = round(n*base/(base+1)) nc = n - nt for (ii in 1:iter){ ntp = rbinom(1, nt, vecY) ntn = nt - ntp ncp = rbinom(1, nc, vecY) ncn = nc - ncp S = c(rexp(ntp, log(2)/myp), rexp(ntn, log(2)/myn), rexp(ncp, log(2)/mxp), rexp(ncn, log(2)/mxn)) data1 = data.frame(Group = c(rep("T", nt), rep("C", nc)), dx = c(rep("P", ntp), rep("N", ntn), rep("P", ncp), rep("N", ncn)), S) fit = survfit(Surv(data1$S)~data1$Group) coxfit = coxph(Surv(data1$S)~data1$Group) HR = exp(coxfit$coefficients) p.val=summary(coxfit)$logtest["pvalue"] out = rbind(out, c(vecX, vecY, ntp, ntn, ncp, ncn, HR, p.val)) } } colnames(out) = c("vecX", "vecY", "ntp", "ntn", "ncp", "ncn", "HR", "p.val") ans = as.data.frame(out) ANS = rbind(ANS, ans) } 
+8
r
source share
2 answers

Yes, I believe that theoretically this should matter (see example below).

R uses the order of core columns such as Fortran (and unlike C), so to minimize cache misses you would like to cross the columns. So the best approach to populate the matrix is ​​where the outer loop has our column index.

And for n-dimensional arrays, you would also like to keep that in mind. In case n = 3 , I assume that this would mean that this layer is the outermost loop, then a column, and then a row. I could be wrong here.

I ran this quick example using 5000 by 5000 . And we see a difference of about 50 seconds, with fill_matrix2() faster.

  n <- 5000 A <- matrix(NA, n, n) B <- matrix(NA, n, n) fill_matrix1 <- function(X, val) { for (i in 1:nrow(X)) { for (j in 1:ncol(X)) { X[i, j] <- val } } return(X) } fill_matrix2 <- function(X, val) { for (j in 1:ncol(X)) { for (i in 1:nrow(X)) { X[i, j] <- val } } return(X) } system.time(fill_matrix1(A, 0)) system.time(fill_matrix2(B, 0)) 
+7
source share

The order of the loops is practically irrelevant. If you look at your code (see help("Rprof") ), you will see that processor time is spent on features such as survfit and coxph . And, of course, growth out , which should be avoided. Pre-select out to its final size and fill it, not increase it.

+4
source share

All Articles