Looping through rows in a data frame

Suppose for some reason I need to loop through rows in a data frame.

I create a simple data.frame

df <- data.frame(id = sample(1e6, 1e7, replace = TRUE)) 

It seems that f2 is much slower than f1, while I expected them to be equivalent.

 f1 <- function(v){ for (obs in 1:(1e6) ){ a <- v[obs] } a } system.time(f1(df$id)) f2 <- function(){ for (obs in 1:(1e6) ){ a <- df$id[obs] } a } system.time(f2()) 

Do you know, why? Do they use exactly the same amount of memory?

+7
performance r
source share
2 answers

If you write your timings instead and think that df$x is really a function call (before `$`(df,x) ), the secret disappears:

 system.time(for(i in 1:1e6) df$x) # user system elapsed # 8.52 0.00 8.53 system.time(for(i in 1) df$x) # user system elapsed # 0 0 0 
+6
source share

In f1 you completely circumvent the data frame by simply passing the vector to your function. So your code is essentially "I have a vector! This is the first element. This is the second element. This is the third ..."

Unlike f2 , you give it a whole frame of data, and then each time each element of one column. So your code is: "I have a data frame. This is the first element of the identifier column. This is the second element of the identifier column. This is the third ..."

This is much faster if you extract a simple data structure (vector) once, and then you can only work with it, and not repeatedly extract a simple structure from a larger object.

+2
source share

All Articles