Move / Move Group

How to create a moving average with grouped data. Data here

set.seed(31) dd<-matrix(sample(seq(1:20),30,replace=TRUE),ncol=3) 

Add group id and sort by group id

 du<-sample(seq(1:4),10,replace=TRUE) d<-cbind(du,dd) d<-d[order(d[,1]),] 

This gives a moving average but ignores the bounderis group

 d_roll_mean <- apply(d[,2:4], 2, function(x) { rollapply(zoo(x), 3, mean, partial=TRUE, align='right') } ) 

Below are the results below.

 # cbind(d,d_roll_mean) # [1,] 1 3 3 12 3.000000 3.000000 12.000000 # [2,] 2 10 13 8 6.500000 8.000000 10.000000 # [3,] 2 17 2 17 10.000000 6.000000 12.333333 # [4,] 3 14 6 3 13.666667 7.000000 9.333333 # [5,] 3 6 20 1 12.333333 9.333333 7.000000 # [6,] 3 1 16 19 7.000000 14.000000 7.666667 # [7,] 3 19 2 11 8.666667 12.666667 10.333333 # [8,] 4 12 1 9 10.666667 6.333333 13.000000 # [9,] 4 10 13 12 13.666667 5.333333 10.666667 # [10,] 4 8 20 7 10.000000 11.333333 9.333333 

Here is the target sliding along the border of the group

 # Desired # [1,] 1 3 3 12 3.000000 3.000000 12.000000 # [2,] 2 10 13 8 10.000000 13.000000 8.000000 # [3,] 2 17 2 17 13.500000 7.500000 12.500000 # [4,] 3 14 6 3 14.000000 6.000000 3.000000 # [5,] 3 6 20 1 10.000000 13.000000 2.000000 # [6,] 3 1 16 19 7.000000 14.000000 7.666667 # [7,] 3 19 2 11 8.666667 12.666667 10.333333 # [8,] 4 12 1 9 12.000000 1.000000 9.000000 # [9,] 4 10 13 12 11.000000 7.000000 10.500000 # [10,] 4 8 20 7 10.000000 8.000000 9.333333 

This is close, but generates a list by coefficient instead of matrix

 doApply <- function(x) { apply(x, 2, function(y) { rollapply(zoo(y), 3, mean, partial=TRUE, align='right') }) } d2_roll_mean <- by(d[,2:4], d[,1], doApply) 

So there are some answers to the question, here is how they compare the runtime

 set.seed(31) nrow=20000 ncol=600 nun=350 nValues = 20 dd<-matrix(sample(seq(1:nValues),nrow*ncol,replace=TRUE),ncol=ncol) du<-sample(seq(1:nun),nrow,replace=TRUE) d<-cbind(du,dd) d<-d[order(d[,1]),] library(zoo) doApply <- function(x) { apply(x, 2, function(y) { rollapply(zoo(y), 3, mean, partial=TRUE, align='right') }) } library(data.table) library(caTools) fun1<-function(d) {by(d[,-1], d[,1], doApply)} fun2<- function(d){ DT <- data.table(d, key='du') DT[, lapply(.SD, function(y) runmean(y, 3, alg='fast',align='right')), by=du] } system.time(d2_roll_mean <- fun1(d)) system.time(d2_roll_mean2 <- fun2(d)) 

Timing shows using data tables about 10 times faster than rollapply.

  user system elapsed fun1 1048.910 0.378 1049.158 fun2 107.296 0.097 107.392 

I don't get equality, but when checked, they seem to be the same ...

 d2a<-do.call(rbind,d2_roll_mean) d2b<-cbind(1,d2a) d2c<-data.table(d2b) setnames(d2c,names(d2c),names(d2_roll_mean2)) all.equal(d2c,d2_roll_mean2) 

The output of all equal equals

 [1] "Attributes: < Length mismatch: comparison on first 1 components >" [2] "Component "du": Mean relative difference: 175.6631" 

When the above approach was applied to the data, the following error was received:

 Error in `[<-`(`*tmp*`, (k2 + 1):n, , value = 2) : subscript out of bounds 

This error was the result of some factors having too few rows. These lines were deleted and the process worked. Link: How to remove odds that have less than n elements

+7
r grouping moving-average
source share
2 answers

The only thing missing is do.call(rbind,d2_roll_mean) . Add raw data:

 cbind(d,do.call(rbind,d2_roll_mean)) 

EDIT: I passed this through system.time() for a larger example, and this takes its sweet time:

 set.seed(31) dd <- matrix(sample(seq(1:20),20000*500,replace=TRUE),ncol=500) du <- sample(seq(1:350),20000,replace=TRUE) d <- cbind(du,dd) d <- d[order(d[,1]),] system.time(d2_roll_mean <- by(d[,-1], d[,1], doApply)) User System elapsed 399.60 0.57 409.91 

by() and apply() are not the fastest functions. It might be faster to go through the columns with the for loop and do it with brute force, relying on the fact that d sorted by id.

+3
source share

Using data.table and caTools

 library(data.table) library(caTools) DT <- data.table(d, key='du') DT[, lapply(.SD, function(y) runmean(y, 3, alg='fast',align='right')), by=du] 

Update

If you want to create new columns in an existing dataset

  nm1 <- paste0('V', 2:4) nm2 <- paste0("V", 4:6) DT[, (nm1):=lapply(.SD, as.numeric), .SDcols=nm1][, (nm2):=lapply(.SD, function(y) runmean(y, 3, alg='fast', align='right')), by=du] 
+5
source share

All Articles