How to replace for-loop with vecorization, acting several thousand times on data.frame line?

Question

How to replace for-loop with vecorization, acting several thousand times on data.frame line?

Still wet enough behind my ears with respect to R and - more importantly - vectorization, I can't figure out how to speed up the code below.

For the loop, the number of seeds falling on the road for several sections of the road with different density of seed plants is calculated using a random opportunity for each seed. Since my real data frame has ~ 200 thousand lines and the number of seeds is up to 300 thousand / segment, using the example below, it will take several hours on my current computer.

#Example data.frame df <- data.frame(Density=c(0,0,0,3,0,120,300,120,0,0)) #Example SeedRain vector SeedRainDists <- c(7.72,-43.11,16.80,-9.04,1.22,0.70,16.48,75.06,42.64,-5.50) #Calculating the number of seeds from plant densities df$Seeds <- df$Density * 500 #Applying a probability of reaching the road for every seed df$SeedsOnRoad <- apply(as.matrix(df$Seeds),1,function(x){ SeedsOut <- 0 if(x>0){ #Summing up the number of seeds reaching a certain distance for(i in 1:x){ SeedsOut <- SeedsOut + ifelse(sample(SeedRainDists,1,replace=T)>40,1,0) } } return(SeedsOut) })

If someone can give me a hint about how the loop can be replaced by vectorization - or maybe it is best to organize the data to improve performance, I would really appreciate it!

Edit: Roland's answer showed that I might have simplified the question. In the for-loop, I am extracting a random value from the distribution of distances recorded by another author (therefore, I cannot provide the data here). An illustrative vector with probable values of SeedRain distances has been added.

+4

performance vectorization for-loop r

sir_husefugg Mar 08 '13 at 17:26

source share

2 answers

This must be done for the same simulation:

 df$SeedsOnRoad2 <- sapply(df$Seeds,function(x){ rbinom(1,x,0.6) }) # Density Seeds SeedsOnRoad SeedsOnRoad2 #1 0 0 0 0 #2 0 0 0 0 #3 0 0 0 0 #4 3 1500 892 877 #5 0 0 0 0 #6 120 60000 36048 36158 #7 300 150000 90031 89875 #8 120 60000 35985 35773 #9 0 0 0 0 #10 0 0 0 0

+5

Rolling Mar 08 '13 at 17:41

source share

Gavin simpson · Accepted Answer · 2013-03-08T18:22:40+0000

One parameter generates sample() for all Seeds for each df line at a time.

Using set.seed(1) before your loop based code, I get:

 > df Density Seeds SeedsOnRoad 1 0 0 0 2 0 0 0 3 0 0 0 4 3 1500 289 5 0 0 0 6 120 60000 12044 7 300 150000 29984 8 120 60000 12079 9 0 0 0 10 0 0 0

I get the same answer in a fraction of the time if I do this:

 set.seed(1) tmp <- sapply(df$Seeds, function(x) sum(sample(SeedRainDists, x, replace = TRUE) > 40))) > tmp [1] 0 0 0 289 0 12044 29984 12079 0 0

For comparison:

 df <- transform(df, GavSeedsOnRoad = tmp) df > df Density Seeds SeedsOnRoad GavSeedsOnRoad 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 3 1500 289 289 5 0 0 0 0 6 120 60000 12044 12044 7 300 150000 29984 29984 8 120 60000 12079 12079 9 0 0 0 0 10 0 0 0 0

The following points should be noted here:

try not to call the function again in a loop if you bill the function or you can generate the whole end result in one call. Here you called sample() Seeds once for each df line, each call returned one sample from SeedRainDists . Here I make one call to sample() to request the Seeds sample size for each df line - so I call sample 10 times, your code is called 271500 times.
even if you need to repeatedly call a function in a loop, remove from the loop everything that has been vectorized, which could be done on the whole result after the loop is completed. An example here is your accumulation of SeedsOut , which calls +() large number of times.
It would be better to collect each SeedsOut in a vector, and then sum() this vector outside the loop. For instance.
```
 SeedsOut <- numeric(length = x) for(i in seq_len(x)) { SeedsOut[i] <- ifelse(sample(SeedRainDists,1,replace=TRUE)>40,1,0) } sum(SeedOut) 
```
Note that R treats the boolean as if it were a numeric 0 or 1 , where it was used in any mathematical function. Hence
```
 sum(ifelse(sample(SeedRainDists, 100, replace=TRUE)>40,1,0)) 
```
and
```
 sum(sample(SeedRainDists, 100, replace=TRUE)>40) 
```
will give the same result if it works with the same set.seed() .

There may be a more attractive way to perform a selection that requires fewer sample() calls (and there is, sample(SeedRainDists, sum(Seeds), replace = TRUE) > 40 , but then you need to take care of choosing the right elements of this vector for each df line - not difficult, just cumbersome), but what I show can be quite effective?

How to replace for-loop with vecorization, acting several thousand times on data.frame line?

More articles: