Using sapply vs. for efficient writing to pre-distributed data structures

Question

Using sapply vs. for efficient writing to pre-distributed data structures

Suppose I have a pre-distributed data structure that I want to write for performance purposes, or as the data structure grows over time. First I tried this using sapply:

set.seed(1) count <- 5 pre <- numeric(count) sapply(1:count, function(i) { pre[i] <- rnorm(1) }) pre # [1] 0 0 0 0 0 for(i in 1:count) { pre[i] <- rnorm(1) } pre # [1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884

I assume this is because the anonymous function in sapply is in a different area (or is it an environment in R?), And as a result, the pre object is not the same. The for loop exists in the same area / environment and therefore works as expected.

I generally tried using R mechanisms to iterate using functions vs. for, but I do not see the possibility here. Is there something else I should do or a better idiom for this type of operation?

As already noted, my example is very far-fetched, I am not interested in generating normal deviations. Instead, my actual code deals with four columns with 1.5 million rows of data. I used to rely on growth and merging to get the final data frame, and decided to try to avoid merging and prevail based on benchmarking.

+6

performance r

mindless.panda Oct 19 '12 at 12:06

source share

4 answers

Joshua ulrich · Answer 1 · 2012-10-19T13:57:46+0000

sapply not intended to be used in this way. He already pre-selects the result.

Despite this, the for loop is not a source of slow performance; this is probably because you multiply data.frame again. For instance:

 set.seed(21) N <- 1e4 d <- data.frame(n=1:N, s=sample(letters, N, TRUE)) l <- as.list(d) set.seed(21) system.time(for(i in 1:N) { d$n[i] <- rnorm(1); d$s <- sample(letters,1) }) # user system elapsed # 6.12 0.00 6.17 set.seed(21) system.time(for(i in 1:N) { l$n[i] <- rnorm(1); l$s <- sample(letters,1) }) # user system elapsed # 0.14 0.00 0.14 D <- as.data.frame(l, stringsAsFactors=FALSE) identical(d,D) # [1] TRUE

So, you have to iterate over individual vectors and combine them into data.frame after the loop.

Matthew plourde · Answer 2 · 2012-10-19T12:21:54+0000

The apply family is not intended to perform side effects, such as changing the state of a variable. These functions are designed to simply return values that are then assigned to a variable. This is consistent with the functional paradigm to which R partially joins. If you use these functions as intended, pre-distribution does not make much difference, and this part of their appeal. You can easily do this without prior distribution: p <- sapply(1:count, function(i) rnorm(1)) . But this example is a bit artificial --- p <- rnorm(5) is what you would use.

If your actual problem is different from this and you have performance issues, vapply . It is just like sapply , but allows you to specify the resulting data type, giving it an advantage in speed. If this does not help, check the data.table or ff packages.

Gavin simpson · Answer 3 · 2012-10-19T12:21:41+0000

Yes, you essentially change the pre , which is local to the anonymous function, which itself will return the result of the last estimate (vector of length 1), therefore sapply() returns the correct solution as a vector (because it accumulates individual lengths of 1 vector), but it does not change pre in the global workspace.

You can get around this using the <<- operator:

 set.seed(1) count <- 5 pre <- numeric(count) sapply(1:count, function(i) { pre[i] <<- rnorm(1) }) > pre [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078

What has changed pre , but I would not do it for various reasons.

In this case, I don’t think that much can be learned from the pre-allocation of pre in the case of sapply() .

Also, for this example, both are terribly ineffective; just get rnorm() to generate random count numbers. But I think this example was just to illustrate the point?

John · Answer 4 · 2012-10-19T12:39:00+0000

I'm not sure what you are asking. The traditional idiom for sapply in this case would be

 pre <- sapply( 1:count, function(x) rnorm(1) )

There you do not need to preallocate, but you are not limited to using a pre-distributed variable.

I assume that everything will be much clearer if you expose your actual cycle that you want to change. You say that you have performance problems, and you can get an answer here that can greatly optimize the situation. There are several respondents who love such problems.

It also sounds like you have a long function or loop. Applicable family functions are primarily intended for expressiveness and allow you to make it more understandable when mixing vectorized functions and things that cannot be. A few small sapply calls mixed with vectorized functions are much faster than one big loop in R.

Using sapply vs. for efficient writing to pre-distributed data structures

More articles: