Find the sum of the previous n rows in a dataframe

I want to find the sum of the previous rows n in the data frame. For instance:

 id = 1:10 vals = c(4,7,2,9,7,0,4,6,1,8) test = data.frame(id,vals) 

So, for n=3 , I would like to calculate the following column as:

 test$sum = c(NA, NA, 13,18,18,16,11,10,11,15) 

The closest I came, creating a new column using:

 test$valprevious = c(NA, head(test$vals,-1) 

Then, using a loop to repeat this n times, then sum over the columns. I'm sure this is not the most efficient method, are there any functions that access the n previous lines? Or another way to do this?

+6
source share
1 answer

You can use the rollsumr function from the zoo package to do this:

 library(zoo) test$sums <- rollsumr(test$vals, k = 3, fill = NA) 

which gives:

 > test id vals sums 1 1 4 NA 2 2 7 NA 3 3 2 13 4 4 9 18 5 5 7 18 6 6 0 16 7 7 4 11 8 8 6 10 9 9 1 11 10 10 8 15 

This is the same as the rollsum function with the align = 'right' parameter:

 rollsum(test$vals, k = 3, fill = NA, align = 'right') 

Alternatively, you can use Reduce with shift from the data.table package:

 library(data.table) setDT(test)[, sums := Reduce(`+`, shift(vals, 0:2))] 

which gives the same result:

 > test id vals sums 1: 1 4 NA 2: 2 7 NA 3: 3 2 13 4: 4 9 18 5: 5 7 18 6: 6 0 16 7: 7 4 11 8: 8 6 10 9: 9 1 11 10: 10 8 15 

A nice alternative to R base suggested by @alexis_laz in the comments:

 n <- 3 cs <- cumsum(test$vals) test$sums <- c(rep_len(NA, n - 1), tail(cs, -(n - 1)) - c(0, head(cs, -n))) 

Another option suggested by @ Hashaa in the comments:

 # with base R n <- 3 test$sums <- c(rep_len(NA, n - 1), rowSums(embed(test$vals, n))) # with RcppRoll library(RcppRoll) test$sums <- roll_sumr(test$vals, 3) 

Landmarks:

As @alexis_laz noted in the comments, some of the solutions can create overhead when recounting and recreating length vectors. This can lead to a difference in computational speed.

 # creating function of the different solutions: alexis_laz <- function(test) {n <- 3; cs <- cumsum(test$vals); test$sums <- c(rep_len(NA, n - 1), tail(cs, -(n - 1)) - c(0, head(cs, -n)))} khashaa <- function(test) {n <- 3; test$sums <- c(rep_len(NA, n - 1), rowSums(embed(test$vals, n)))} rcpp_roll <- function(test) test$sums <- roll_sumr(test$vals, 3) zoo_roll <- function(test) test$sums <- rollsumr(test$vals, k=3, fill=NA) dt_reduce <- function(test) setDT(test)[, sums := Reduce(`+`, shift(vals, 0:2))] 

Running the test in a small sample dataset:

 library(microbenchmark) microbenchmark(alexis_laz(test), khashaa(test), rcpp_roll(test), zoo_roll(test), dt_reduce(test), times = 10) 

which gives:

 Unit: microseconds expr min lq mean median uq max neval cld alexis_laz(test) 61.390 99.507 107.7025 108.7515 122.849 131.376 10 a khashaa(test) 35.758 92.596 94.1640 100.4875 103.264 112.779 10 a rcpp_roll(test) 26.727 99.709 96.1154 106.1295 114.483 116.553 10 a zoo_roll(test) 304.586 389.991 390.7553 398.8380 406.352 419.544 10 c dt_reduce(test) 254.837 258.979 277.4706 264.0625 269.711 389.606 10 b 

As you can see the RcppRoll solution, and the two basic R solutions @Alexis_laz and @Khashaa are much faster than the solutions zoo and data.table (but still in microseconds, so there is nothing to worry about).

With a much larger dataset:

 test <- data.frame(id=rep(1:10,1e7), vals=sample(c(4,7,2,9,7,0,4,6,1,8),1e7,TRUE)) 

image changes:

 Unit: milliseconds expr min lq mean median uq max neval cld alexis_laz(test) 3181.4270 3447.1210 4392.166 4801.410 4889.001 5002.363 10 b khashaa(test) 6313.4829 7305.3334 7478.831 7680.176 7723.830 7859.335 10 c rcpp_roll(test) 373.0379 380.9457 1286.687 1258.165 2062.388 2417.733 10 a zoo_roll(test) 38731.0369 39457.2607 40566.126 40940.586 41114.990 42207.149 10 d dt_reduce(test) 1887.9322 1916.8769 2128.567 2043.301 2218.635 2698.438 10 a 

Now the RcppRoll solution is undoubtedly the fastest, followed by the data.table solution.

+11
source

All Articles