Do large loops hang in R?

Suppose I want to perform a simulation using the following function :

 fn1 <- function(N) { res <- c() for (i in 1:N) { x <- rnorm(2) res <- c(res, x[2]-x[1]) } res } 

For very large N calculation seems to freeze. Are there any better ways to do this?

(Inspired by: https://stat.ethz.ch/pipermail/r-help/2008-February/155591.html )

+7
for-loop r
Jul 23 '09 at 4:15
source share
5 answers

For loops in R, they are obviously slow, but there is one more problem. It is much faster to pre-distribute the result vector, res, and add to res at each iteration.

Below we can compare the speed of the above version with the version that simply starts with the res vector, length N and changes the ith element during the cycle.

 fn1 <- function(N) { res <- c() for (i in 1:N) { x <- rnorm(2) res <- c(res,x[2]-x[1]) } res } fn2 <- function(N) { res <- rep(0,N) for (i in 1:N) { x <- rnorm(2) res[i] <- x[2]-x[1] } res } > N <- 50000 > system.time(res1 <- fn1(N)) user system elapsed 6.568 0.256 6.826 > system.time(res2 <- fn2(N)) user system elapsed 0.452 0.004 0.496 

Also, as Sharpie points out , we can do this a little faster using R functions like apply (or its siblings, sapply and lapply ).

 fn3 <- function(N) { sapply( 1:N, function( i ){ x <- rnorm(2); return( x[2] - x[1] ) } ) } > system.time(res3 <- fn3(N)) user system elapsed 0.397 0.004 0.397 
+2
Jul 23 '09 at 4:19
source share

Loop efficiency can be significantly increased in R by using application functions that essentially process entire data vectors at the same time, rather than looping through them. For the loop shown above, at each iteration, two main operations occur:

 # A vector of two random numbers is generated x <- rnorm( 2 ) # The difference between those numbers is calculated x[2] - x[1] 

In this case, the corresponding function will be sapply() . sapply() works in a list of objects, such as a vector generated by the 1:N loop operator, and returns a result vector:

 sapply( 1:N, function( i ){ x <- rnorm(2); return( x[2] - x[1] ) } ) 

Please note that the index value i available during the function call and sequentially takes values โ€‹โ€‹between 1 and N , however, in this case it is not required.

Being in the habit of recognizing where apply can be used compared to for is a very valuable skill - many R libraries for parallel computing provide parallel work with plugins through the apply functions. Using apply often allows you to access significant performance improvements in multi-core systems with code refactoring of zero .

+9
Jul 23 '09 at 4:27
source share

Turning around my comment on chris_dubois, answer here: time information:

 > system.time(res <- rnorm(50000) - rnorm(50000)) user system elapsed 0.06 0.00 0.06 

Contrast this with fn3 from the same answer:

 > system.time(res3 <- fn3(50000)) user system elapsed 1.33 0.01 1.36 

The first thing to notice is that my laptop is slower than the chris_dubois machine. :)

The second and, more importantly, the point of view that the vector approach, which is quite applicable here, is an order of magnitude faster. (Richie Cotton also noted in a comment on the same answer).

This brings me to the end point: it is a myth that apply and his friends are much faster than for loops in R. They are in the same order in most dimensions I have seen. Because they are just for loop backstage. See also this entry:

http://yusung.blogspot.com/2008/04/speed-issue-in-r-computing-apply-vs.html

According to Professor Brian Ripley, "apply () is just a wrapper for the loop." The only advantage of using apply () is that it makes your code tidier!

That's right. You should use apply if it is more expressive, especially if you are programming in a functional style. Not because it's faster.

+4
Jul 26 '09 at 4:34
source share

Sometimes a loop is not needed. Since rnorm gives the iid sample (theoretically), you will achieve the same result (sample XY where X and Y are N (0,1)) by doing:

 res <- rnorm(N)-rnorm(N) 
+2
Jul 24 '09 at 7:31
source share

Perhaps the most effective replacement for your function would be:

 fn <- function(n) rnorm(N,0,sqrt(2)) 

which is two times faster than the difference between normal variations. More generally, if your goal is to run simple simulations, pre-allocating vectors / arrays and calls to native functions will greatly speed up the process.

If you want to run monte-carlo for statistical evaluations (e.g. MCMC), R has a number of native packages. For general stochastic modeling, I don't know the R packages, but you can try Simpy ( http://simpy.sourceforge.net/ ), which is great.

0
Jul 27 '09 at 15:00
source share



All Articles