Iterate through a data frame, where each iteration depends on the previous element in R effectively

Question

Iterate through a data frame, where each iteration depends on the previous element in R effectively

I have a data frame with two vectors of length 5 and a variable:

x <- seq(1:5) y <- rep(0,5) df <- data.frame(x, y) z <- 10

I need to skip the data frame and update y based on the condition associated with x with z, and I need to update z at each iteration. Using a for loop, I would do the following:

 for (i in seq(2,nrow(df))){ if(df$x[i] %% 2 == 0){ df$y[i] <- df$y[i-1] + z z <- z - df$x[i] } else{ df$y[i] <- df$y[i-1] } }

Using data frames is slow, and accessing the ith element using df $ x [i] is inefficient, but I'm not sure how to vectorize it, since both y and z will change depending on each iteration.

Does anyone have any recommendations on the best way to iterate over this? I wanted to completely get rid of data frames and just use vectors to simplify the search, or use something from tidyverse using tibles and the purrr package, but nothing was easy to implement. Thanks!

+7

vectorization iteration r tidyverse purrr

George Feb 08 '18 at 22:20

source share

4 answers

Onyambu · Answer 1 · 2018-02-09T00:23:06+0000

you can use sapply function:

 y=0 z=10 sapply(df$x,function(x)ifelse(x%%2==0,{y<<-y+z;z<<-zx;y},y<<-y)) [1] 0 10 10 18 18

CPak · Answer 2 · 2018-02-09T04:49:17+0000

Here is the vector option

 vec_fun <- function(x, z) { L <- length(x) vec_z <- rep(0, L) I <- seq(2, L, by=2) vec_z[I] <- head(zc(0, cumsum(I)), length(I)) cumsum(vec_z) }

Alternative versions - sapply and tidyverse

 sapply_fun <- function(x, z) { y=0 sapply(df$x,function(x)ifelse(x%%2==0,{y<<-y+z;z<<-zx;y},y<<-y)) } library(tidyverse) library(tidyverse) tidy_fun <- function(df) { df %>% filter(x %% 2 != 0) %>% mutate(z = accumulate(c(z, x[-1] - 1), `-`)) %>% right_join(df, by = c("x", "y")) %>% mutate(z = lag(z), z = ifelse(is.na(z), 0, z)) %>% mutate(y = cumsum(z)) %>% select(-z) %>% pluck("y") }

Your data

 df <- data.frame(x=1:5, y=0) z <- 10

Let them all return the same result.

 identical(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df)) # TRUE

Benchmark with a small data set - sapply_fun looks a little faster

 library(microbenchmark) microbenchmark(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df), times=100L, unit="relative") # Unit: relative # expr min lq mean median uq max neval # vec_fun(df$x, z) 1.349053 1.316664 1.256691 1.359864 1.348181 1.146733 100 # sapply_fun(df$x, z) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 # tidy_fun(df) 411.409355 378.459005 168.689084 301.029545 270.519170 4.244833 100

Now with a large data.frame

 df <- data.frame(x=1:1000, y=0) z <- 10000

The same result - yes

 identical(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df)) # TRUE

Large dataset benchmark - now it's obvious vec_fun faster

 library(microbenchmark) microbenchmark(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df), times=100L, unit="relative") # Unit: relative # expr min lq mean median uq max neval # vec_fun(df$x, z) 1.00000 1.00000 1.00000 1.00000 1.00000 1.000 100 # sapply_fun(df$x, z) 42.69696 37.00708 32.19552 35.19225 27.82914 27.285 100 # tidy_fun(df) 259.87893 228.06417 201.43230 218.92552 172.45386 380.484 100

jaySf · Answer 3 · 2018-02-08T22:49:42+0000

Since your data contains only numbers, you can use a matrix rather than a data frame, which is slightly faster.

 mx <- matrix(c(x, y), ncol = 2, dimnames = list(1:length(x), c("x", "y"))) for (i in seq(2, nrow(mx))){ if(mx[i, 1] %% 2 == 0){ mx[i, 2] <- mx[i-1, 2] + z z <- z - mx[i, 1] } else { mx[i, 2] <- mx[i-1, 2] } } mx # xy # 1 1 0 # 2 2 10 # 3 3 10 # 4 4 18 # 5 5 18

microbenchmark() results:

 # Unit: milliseconds # expr min lq mean median uq max neval # mx 8.675346 9.542153 10.71271 9.925953 11.02796 89.35088 1000 # df 10.363204 11.249255 12.85973 11.785933 13.59802 106.99920 1000

www · Answer 4 · 2018-02-09T00:29:36+0000

It would be great if we could vectorize the operation in a data frame. My strategy is to calculate z values for each row, and then use cumsum to calculate y value. The accumulate function from purrr is to compute z values. right_join from dplyr and fill from tidyr package is for further processing of the format.

 library(tidyverse) df2 <- df %>% filter(x %% 2 != 0) %>% mutate(z = accumulate(c(z, x[-1] - 1), `-`)) %>% right_join(df, by = c("x", "y")) %>% mutate(z = lag(z), z = ifelse(is.na(z), 0, z)) %>% mutate(y = cumsum(z)) %>% select(-z) df2 # xy # 1 1 0 # 2 2 10 # 3 3 10 # 4 4 18 # 5 5 18

Iterate through a data frame, where each iteration depends on the previous element in R effectively

More articles: