The moving sum of another variable in R

Question

The moving sum of another variable in R

I want to get a rolling 7-day amount by ID. Suppose my data looks like this:

data<-as.data.frame(matrix(NA,42,3)) data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3) data$V2<-rep(1:6,7) data$V3<-rep(c(1,2),21) colnames(data)<-c("Date","USD","ID") Date USD ID 1 2014-05-01 1 1 2 2014-05-04 2 2 3 2014-05-07 3 1 4 2014-05-10 4 2 5 2014-05-13 5 1 6 2014-05-16 6 2 7 2014-05-19 1 1 8 2014-05-22 2 2 9 2014-05-25 3 1 10 2014-05-28 4 2

How to add a new column that will contain a 7-day moving amount by ID?

+8

r xts data.table

jgreenb1 Jun 24 '14 at 22:19

source share

5 answers

Mike.gahan · Answer 1 · 2014-06-25T05:08:24+0000

If your data is big, you can check out this solution that uses data.table . This is pretty fast. If you need a higher speed, you can always change mapply to mcmapply and use multiple cores.

 #Load data.table and convert to data.table object require(data.table) setDT(data)[,ID2:=.GRP,by=c("ID")] #Build reference table Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")] #Use mapply to get last seven days of value by id data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) { d <- as.numeric(Ref$Compare_Date[[NUM]] - RD) sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})]

jangorecki · Answer 2 · 2015-01-16T12:03:54+0000

The dataset provided by the OP does not reveal the complexity of the task. In terms of solving the OP question, so far only Mike’s answer has been correct.
In fact, within 8 days of rolling, instead of 7 calendar days, due to d <= 0 & d >= -7 .
zoo using @G. A grottendik is almost valid only if merge is done for each group of ID .
Below the second solution is data.table, this time the actual results using dev RcppRoll, which allows na.rm=TRUE .
And a slightly formatted output from Mike's solution.

 data<-as.data.frame(matrix(NA,42,3)) data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3) data$V2<-rep(1:6,7) data$V3<-rep(c(1,2),21) colnames(data)<-c("Date","USD","ID") library(microbenchmark) library(RcppRoll) # install_github("kevinushey/RcppRoll") library(data.table) # install_github("Rdatatable/data.table") correct_jan_dt = function(n, partial=TRUE){ DT = as.data.table(data) # this can be speedup by setDT() date.range = DT[,range(Date)] all.dates = seq.Date(date.range[1],date.range[2],by=1) setkey(DT,ID,Date) r = DT[CJ(unique(ID),all.dates)][, c("roll") := as.integer(roll_sumr(USD, n, normalize = FALSE, na.rm = TRUE)), by="ID"][!is.na(USD)] # This could be simplified when `partial` arg will be implemented in [kevinushey/RcppRoll](https://github.com/kevinushey/RcppRoll) if(isTRUE(partial)){ r[is.na(roll), roll := cumsum(USD), by="ID"][] } return(r[order(Date,ID)]) } correct_mike_dt = function(){ data = as.data.table(data)[,ID2:=.GRP,by=c("ID")] #Build reference table Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")] #Use mapply to get last seven days of value by id data[, c("roll") := mapply(RD = Date,NUM=ID2, function(RD, NUM){ d <- as.numeric(Ref$Compare_Date[[NUM]] - RD) sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})][,ID2:=NULL][] } identical(correct_mike_dt(), correct_jan_dt(n=8,partial=TRUE)) # [1] TRUE microbenchmark(unit="relative", times=5L, correct_mike_dt(), correct_jan_dt(8)) # Unit: relative # expr min lq mean median uq max neval # correct_mike_dt() 274.0699 273.9892 267.2886 266.6009 266.2254 256.7296 5 # correct_jan_dt(8) 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 5

Looking forward to an update from @Khashaa.

Edit (20150122.2): below the benchmarks do not answer the OP question.

Dates on a larger (still very tiny) dataset, 5439 rows:

 library(zoo) library(data.table) library(dplyr) library(RcppRoll) library(microbenchmark) data<-as.data.frame(matrix(NA,5439,3)) data$V1<-seq(as.Date("1970-01-01"),as.Date("2014-09-01"),by=3) data$V2<-sample(1:6,5439,TRUE) data$V3<-sample(c(1,2),5439,TRUE) colnames(data)<-c("Date","USD","ID") zoo_f = function(){ z <- read.zoo(data) z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily roll <- function(x) rollsumr(x, 7, fill = NA) transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)]) } dt_f = function(){ DT = as.data.table(data) # this can be speedup by setDT() date.range = DT[,range(Date)] all.dates = seq.Date(date.range[1],date.range[2],by=1) setkey(DT,Date) DT[.(all.dates) ][order(Date), c("roll") := rowSums(setDT(shift(USD, 0:6, NA, "lag")),na.rm=FALSE), by="ID" ][!is.na(ID)] } dp_f = function(){ data %>% group_by(ID) %>% mutate(roll=roll_sum(c(rep(NA,6), USD), 7)) } dt2_f = function(){ # this can be speedup by setDT() as.data.table(data)[, c("roll") := roll_sum(c(rep(NA,6), USD), 7), by="ID"][] } identical(as.data.table(zoo_f()),dt_f()) # [1] TRUE identical(setDT(as.data.frame(dp_f())),dt_f()) # [1] TRUE identical(dt2_f(),dt_f()) # [1] TRUE microbenchmark(unit="relative", times=20L, zoo_f(), dt_f(), dp_f(), dt2_f()) # Unit: relative # expr min lq mean median uq max neval # zoo_f() 140.331889 141.891917 138.064126 139.381336 136.029019 137.730171 20 # dt_f() 14.917166 14.464199 15.210757 16.898931 16.543811 14.221987 20 # dp_f() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 # dt2_f() 1.536896 1.521983 1.500392 1.518641 1.629916 1.337903 20

But I'm not sure my data.table code is already optimal.

The above functions did not answer the OP question. Read the top of the post for updates. Mike's decision was right.

G. grothendieck · Answer 3 · 2014-06-24T23:40:59+0000

1) Assuming you mean every consecutive overlapping 7 rows for this ID:

 library(zoo) transform(data, roll = ave(USD, ID, FUN = function(x) rollsumr(x, 7, fill = NA)))

2) If you really meant 7 days, not 7 lines, try the following:

 library(zoo) z <- read.zoo(data) z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily roll <- function(x) rollsumr(x, 7, fill = NA) transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)])

Updated Added (2) and made some improvements.

Jared gossett · Answer 4 · 2015-09-23T14:52:37+0000

 library(data.table) data <- data.table(Date = seq(as.Date("2014-05-01"), as.Date("2014-09-01"), by = 3), USD = rep(1:6, 7), ID = rep(c(1, 2), 21)) data[, Rolling7DaySum := { d <- data$Date - Date sum(data$USD[ID == data$ID & d <= 0 & d >= -7]) }, by = list(Date, ID)]

Yan jiang · Answer 5 · 2015-10-08T15:57:35+0000

I found that there are some problems with the code suggested by Mike.Gahan and fix it, as shown below, after testing.

 require(data.table) setDT(data)[,ID2:=.GRP,by=c("ID")] Ref <-data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))),by=c("ID2")] data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) { d <- as.numeric(Ref[ID2 == NUM,]$Compare_Date[[1]] - RD) sum((d <= 0 & d >= -7)*Ref[ID2 == NUM,]$Compare_Value[[1]])})]

The moving sum of another variable in R

Edit (20150122.2): below the benchmarks do not answer the OP question.

More articles: