The total amount in the window (or the sum of the current window) based on the condition in R

I am trying to calculate the total amount for a given window based on a condition. I saw threads where the solution makes a conditional cumulative sum ( Calculate the conditional current sum in R for each row in the data frame ) and the current sum ( Rolling Sum of another variable in R ), but I could not find them together. I also saw that data.table does not have a rolling function in the R window of data.table. . So this problem is very complicated for me.

Also, the decision posted by Mike Grahan on current amounts is beyond my comprehension. I am looking for the data.table method mainly for speed. However, I am open to other methods, if understood.

Here is my input:

 DFI <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2011, 2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4, 3, 3, 1, 2, 1, 2, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2)), .Names = c("FY", "Customer", "Product", "Rev"), row.names = c(NA, 17L), class = "data.frame") 

Here is my expected result: (Manually created; I apologize if there is a manual error)

 DFO <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4, 3, 3, 1, 2, 3, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2), cumsum = c(4, 7, 10, 11, 9, 3, 6, 10, 15, 21, 3, 2, 2, 4, 2, 2)), .Names = c("FY", "Customer", "Product", "Rev", "cumsum" ), row.names = c(NA, 16L), class = "data.frame") 

Some comments on the logic:

1) I want to find a running amount for a 5 year period. Ideally, I would like this 5-year period to be variable, that is, something that I can specify elsewhere in the code. Thus, I have the right to change the window later for my analysis.

2) The end of the window is based on the maximum year (i.e. FY in the example above). In the above example, the maximum FY in DFI is 2016 . So, the initial year of the window will be 2016 - 5 + 1 = 2012 for all entries in 2016 .

3) The window amount (or current amount) is calculated using Customer and for a specific Product .

What I tried:

I wanted to try something before posting. Here is my code:

  DFI <- data.table::as.data.table(DFI) #Sort it first DFI<-DFI[order(Customer,FY),] #find cumulative sum; remove Rev column; order rows DFOTest<-DFI[,cumsum := cumsum(Rev),by=.(Customer,Product)][,.SD[which.max(cumsum)],by=.(FY,Customer,Product)][,("Rev"):=NULL][order(Customer,Product,FY)] 

This code calculates the total amount, but I can not determine the 5-year window, and then calculate the current amount. I have two questions:

Question 1) How to calculate the 5-year current amount?

Question 2) Can someone explain Mike's method in this thread ? It seems to be fast. However, I'm not quite sure what is going on there. I saw someone request some comments, but I'm not sure if this is self-evident.

Thanks in advance. I struggled with this problem for two days.

+7
r data.table dplyr
source share
4 answers

1) rollapply Create a Sum function that takes FY and Rev as a 2-column matrix (or, if it doesnโ€™t do one), and then sums up the income for these years within k last year. Then convert the DFI to a data table, the amount lines having the same Customer / Product / Year, and run rollapplyr with Sum for each Customer / Product group.

 library(data.table) library(zoo) k <- 5 Sum <- function(x) { x <- matrix(x,, 2) FY <- x[, 1] Rev <- x[, 2] ok <- FY >= tail(FY, 1) - k + 1 sum(Rev[ok]) } DT <- as.data.table(DFI) DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")] DT[, cumsum := rollapplyr(.SD, k, Sum, by.column = FALSE, partial = TRUE), by = c("Customer", "Product"), .SDcols = c("FY", "Rev")] 

giving:

  > DT Customer Product FY Rev cumsum 1: 13575 A 2011 4 4 2: 13575 A 2012 3 7 3: 13575 A 2013 3 10 4: 13575 A 2015 1 11 5: 13575 A 2016 2 9 6: 13575 B 2011 3 3 7: 13575 B 2012 3 6 8: 13575 B 2013 4 10 9: 13575 B 2014 5 15 10: 13575 B 2015 6 21 11: 13578 A 2010 3 3 12: 13578 A 2016 2 2 13: 13578 B 2013 2 2 14: 13578 C 2014 4 4 15: 13578 D 2015 2 2 16: 13578 E 2010 2 2 

2) only data table

The first lines of the sums that have the same Customer / Product / FY, and then, grouping the Customer / Product for each FY, FY value, select Rev values โ€‹โ€‹whose FY values โ€‹โ€‹are between fy-k+1 and FY and sum.

 library(data.table) k <- 5 DT <- as.data.table(DFI) DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")] DT[, cumsum := sapply(FY, function(fy) sum(Rev[between(FY, fy-k+1, fy)])), by = c("Customer", "Product")] 

giving:

 > DT Customer Product FY Rev cumsum 1: 13575 A 2011 4 4 2: 13575 A 2012 3 7 3: 13575 A 2013 3 10 4: 13575 A 2015 1 11 5: 13575 A 2016 2 9 6: 13575 B 2011 3 3 7: 13575 B 2012 3 6 8: 13575 B 2013 4 10 9: 13575 B 2014 5 15 10: 13575 B 2015 6 21 11: 13578 A 2010 3 3 12: 13578 A 2016 2 2 13: 13578 B 2013 2 2 14: 13578 C 2014 4 4 15: 13578 D 2015 2 2 16: 13578 E 2010 2 2 
+5
source share

My decision remains on the tidyverse side, however, if your source data is not excessive, the performance difference may not be a problem.

I will start by declaring a function to calculate the rolling sum using tibbletime::rollify and expanding the data frame to include the missing FY values. Then group and summarize using a sliding amount.

 library(tidyr) library(dplyr) rollsum_5 <- tibbletime::rollify(sum, window = 5) df %>% complete(FY, Customer, Product) %>% replace_na(list(Rev = 0), Rev) %>% arrange(Customer, Product, FY) %>% group_by(Customer, Product, FY) %>% summarise(Rev = sum(Rev)) %>% mutate(cumsum = rollsum_5(Rev)) %>% ungroup %>% filter(Rev != 0) # # A tibble: 16 x 5 # Customer Product FY Rev cumsum # <dbl> <chr> <dbl> <dbl> <dbl> # 1 13575 A 2011 4.00 NA # 2 13575 A 2012 3.00 NA # 3 13575 A 2013 3.00 NA # 4 13575 A 2015 1.00 11.0 # 5 13575 A 2016 2.00 9.00 # 6 13575 B 2011 3.00 NA # 7 13575 B 2012 3.00 NA # 8 13575 B 2013 4.00 NA # 9 13575 B 2014 5.00 15.0 # 10 13575 B 2015 6.00 21.0 # 11 13578 A 2010 3.00 NA # 12 13578 A 2016 2.00 2.00 # 13 13578 B 2013 2.00 NA # 14 13578 C 2014 4.00 4.00 # 15 13578 D 2015 2.00 2.00 # 16 13578 E 2010 2.00 NA 

NB In this case, the moving amount will be displayed only in lines where the window (5 lines) is not damaged. It can hardly be assumed that the partial values โ€‹โ€‹are equal to a five-year sum.

+2
source share

Solution using dplyr , tidyr and zoo .

 # Load packages library(dplyr) library(tidyr) library(zoo) # A helper function to convert the rolling cumsum result cumsum_roll <- function(x){ vec <- c(x[1, ], x[, ncol(x)][-1]) return(vec) } DFI2 <- DFI %>% # Group by FY, Customer, Product group_by_at(vars(-Rev)) %>% # Calculate the total Rev pf each group summarise(Rev = sum(Rev)) %>% ungroup() %>% group_by(Customer) %>% # Expand the data frame based on FY and Product # Fill the Rev to be 0 complete(FY = full_seq(FY, period = 1), Product, fill = list(Rev = 0)) %>% # Sort the data frame by Customer, FY, and Product arrange(Customer, Product, FY) %>% ungroup() %>% group_by(Customer, Product) %>% # Apply the rolling cumsum by rollapply. Specify the window as 5. # cumsum_roll is to transcribe the output of rollapply, a matrix, to a vector mutate(cumsum = cumsum_roll(rollapply(Rev, 5, FUN = cumsum))) %>% # Remove Rev = 0 filter(Rev != 0) %>% # Reorder the columns select(FY, Customer, Product, Rev, cumsum) %>% ungroup() %>% as.data.frame() DFI2 # FY Customer Product Rev cumsum # 1 2011 13575 A 4 4 # 2 2012 13575 A 3 7 # 3 2013 13575 A 3 10 # 4 2015 13575 A 1 11 # 5 2016 13575 A 2 9 # 6 2011 13575 B 3 3 # 7 2012 13575 B 3 6 # 8 2013 13575 B 4 10 # 9 2014 13575 B 5 15 # 10 2015 13575 B 6 21 # 11 2010 13578 A 3 3 # 12 2016 13578 A 2 2 # 13 2013 13578 B 2 2 # 14 2014 13578 C 4 4 # 15 2015 13578 D 2 2 # 16 2010 13578 E 2 2 
+1
source share

Not a new tidyverse answer, but I think nest helps with readability

 library(tidyverse) library(zoo) roll_cumsum <- function(df) { df %>% complete(FY = full_seq(FY, period=1)) %>% mutate(roll_cumsum = rollapplyr(Rev, 5, sum, na.rm=TRUE, partial=TRUE)) } DFI %>% group_by_at(vars(-Rev)) %>% summarise(Rev = sum(Rev)) %>% group_by(Customer, Product) %>% nest(FY, Rev) %>% mutate(data = map(data, ~roll_cumsum(.x))) %>% unnest() %>% filter(!is.na(Rev)) %>% arrange(Customer, Product, FY) # A tibble: 16 x 5 # Customer Product FY Rev roll_cumsum # <dbl> <chr> <dbl> <dbl> <dbl> # 1 13575 A 2011 4.00 4.00 # 2 13575 A 2012 3.00 7.00 # 3 13575 A 2013 3.00 10.0 # 4 13575 A 2015 1.00 11.0 # 5 13575 A 2016 2.00 9.00 # 6 13575 B 2011 3.00 3.00 # 7 13575 B 2012 3.00 6.00 # 8 13575 B 2013 4.00 10.0 # 9 13575 B 2014 5.00 15.0 # 10 13575 B 2015 6.00 21.0 # 11 13578 A 2010 3.00 3.00 # 12 13578 A 2016 2.00 2.00 # 13 13578 B 2013 2.00 2.00 # 14 13578 C 2014 4.00 4.00 # 15 13578 D 2015 2.00 2.00 # 16 13578 E 2010 2.00 2.00 
0
source share

All Articles