Amount in the window that is defined in the column

I want to implement sum(x) for N next rows for each data.table row inside a group, where N is the value from the window column.

Code for generating sample data:

 set.seed(100) ids <- 1:100 x <- floor(runif(100, 1, 100)) groups <- floor(runif(100, 1, 10)) * 10 window <- floor(runif(100, 1, 5)) library('data.table') data <- data.table(ids, x, groups, window) setkey(data, groups, ids) 

Top lines:

  ids x groups window 1: 3 55 10 4 2: 9 55 10 1 3: 13 28 10 1 4: 16 67 10 3 5: 26 17 10 3 6: 30 28 10 2 7: 36 89 10 2 8: 38 63 10 3 9: 42 86 10 3 10: 48 88 10 1 11: 49 21 10 1 12: 59 60 10 3 13: 65 45 10 4 14: 67 46 10 2 15: 88 25 10 4 16: 19 36 20 2 

So, for the first line, the resulting value will be calculated based on the sum of the current and the following 4 lines: res = 55 + 55 + 28 + 67 + 17 = 222

For line 15 where the group ends, I just need the value of the current line: res = 25 + 0 (no lines) = 25.

This is the pseudocode for this logic:

 res <- data[, .(ids, groups, x, window , result = sum(.SD[.CurrentRow:(.CurrentRow + Window)], na.rm = T)), by = groups, .SDcols = c("x")] 

I hope this can be implemented through data.table . I want to avoid implementing a for loop for this.

+5
source share
2 answers

I'm not sure if this can be done without repeating all the lines, so here is one such solution:

 data[, end := pmin(.I + window, .I[.N]), by = groups][ , res := sum(data$x[.I:end]), by = 1:nrow(data)][1:16] # ids x groups window end res # 1: 3 55 10 4 5 222 # 2: 9 55 10 1 3 83 # 3: 13 28 10 1 4 95 # 4: 16 67 10 3 7 201 # 5: 26 17 10 3 8 197 # 6: 30 28 10 2 8 180 # 7: 36 89 10 2 9 238 # 8: 38 63 10 3 11 258 # 9: 42 86 10 3 12 255 #10: 48 88 10 1 11 109 #11: 49 21 10 1 12 81 #12: 59 60 10 3 15 176 #13: 65 45 10 4 15 116 #14: 67 46 10 2 15 71 #15: 88 25 10 4 15 25 #16: 19 36 20 2 18 173 

As alexis_laz points out, you can do it better by calculating cumsum once and then subtracting the extra part, thereby avoiding explicit iteration over the lines:

 data[, res := { cs <- cumsum(x); cs[pmin(1:.N + window, .N)] - shift(cs, fill = 0)} , by = groups] 

I will try to explain what is happening here:

  • res := {...} adds a column to our data table with the calculation of R inside the brackets;
  • cs = cumsum(x) calculates the current amount for all lines within the group;
  • cs[pmin(1:.N + window, .N)] takes the value of the current amount at the end of the window or the last line of the group;
  • shift(cs, fill = 0) gets the current amount from the previous line;
  • the difference of two gives the sum of the elements inside the window.

As there are several working answers to this question, I think it’s worth putting benchmarking here:

 library(microbenchmark) m <- microbenchmark( "tapply(rawr)" = tapplyWay(dd), "data.table(eddi)" = data[, end := pmin(.I + window, .I[.N]), by = groups][ , res := sum(data$x[.I:end]), by = 1:nrow(data)], "data.table(alexis_laz)" = data[, res := {cs = cumsum(x); cs[pmin(1:.N + window, .N)] - shift(cs, fill = 0)} , by = groups], times = 10) print(m) boxplot(m) 

The result for a sample of 10 ^ 5 lines:

 Unit: milliseconds expr min lq mean median uq max neval tapply(rawr) 2575.12 2761.365 2898.63 2905.77 3041.08 3127.86 10 data.table(eddi) 1418.92 1570.230 1758.70 1656.14 1977.59 2358.85 10 dt(alexis_laz) 6.82 7.73 8.78 8.09 10.37 12.37119 10 

benchmarking solutions

+3
source

First, load the base package and convert our data.table to data.frame

 set.seed(100) ids <- 1:100 x <- floor(runif(100, 1, 100)) groups <- floor(runif(100, 1, 10)) * 10 window <- floor(runif(100, 1, 5)) library('data.table') data <- data.table(ids, x, groups, window) setkey(data, groups, ids) dd <- as.data.frame(data) 

And basically bind the rows to a larger frame of data that we can use to summarize using your favorite aggregation method

 tmp <- tapply(seq(nrow(dd)), dd$groups, function(ii) { idx <- Map(`:`, ii, ii + dd$window[ii]) out <- dd[unlist(idx), ] out$idx <- rep(dd$ids[ii], lengths(idx)) out[out$groups %in% dd$groups[ii], ] }) tmp <- do.call('rbind', tmp) res <- aggregate(x ~ idx + groups, tmp, sum) # idx groups x # 1 3 10 222 # 2 9 10 83 # 3 13 10 95 # 4 16 10 201 # 5 26 10 197 # 6 30 10 180 # 7 36 10 238 # 8 38 10 258 # 9 42 10 255 # 10 48 10 109 # 11 49 10 81 # 12 59 10 176 # 13 65 10 116 # 14 67 10 71 # 15 88 10 25 # 16 19 20 173 identical(table(dd$groups), table(res$group)) # [1] TRUE 
+1
source

All Articles