R is the average number of n observations from the most recent observed date

I have a data frame with a person identifier, date of observation and metric. It looks like this:

ID Date Metric a Day 1 9 a Day 2 8 a Day 3 9 a Day 4 8 a Day 5 7 a Day 6 6 a Day 7 5 a Day 8 4 a Day 9 3 a Day 10 3 b Day 1 6 b Day 2 7 b Day 3 6 b Day 4 7 b Day 5 8 b Day 6 9 b Day 7 9 b Day 8 9 

I would like to condense this in one line on the identifier and add variables (prev2, prev3, prev4, prev5) that calculate the average value of n observations since the last observation date (but not including the lastdate on average), For example, "prev2" is the average of the last two observations, and "prev3" is the average of the last three observations. Therefore, prev2 for ID A is the average of the 8th day and 9th day (3.5). prev3 for ID B is the average value of the day 5, 6, 7 (8.67). Ultimately looking back from the most recent / largest date and averaging a series of observations.

It should look something like this:

 ID lastDate metric_avg prev2 prev3 prev4 prev5 a Day 10 6.2 3.5 4 4.5 5 b Day 8 7.63 9 8.67 8.25 7.8 

I am trying to create predictor variables to analyze exhaustion in my company. The thought that 1 month or 2 months after leaving work, Johnny's performance indicators are changing in such a way that one could predict whether Jimmy would be looked after in the near future.

Any recommendations or ideas on how to analyze this data would be super sweet!

Thanks!

+4
source share
2 answers

I would use this with dplyr , tidyr and magrittr .

Data

 df <- data.frame(ID=c(rep("a", 10), rep("b", 8), rep("c", 3), "d"), Date=c(paste("Day", 1:10), paste("Day", 1:8), paste("Day", 11:13), "Day 8"), Metric=c(9, 8, 9, 8, 7, 6, 5, 4, 3, 3, 6, 7, 6, 7, 8, 9, 9, 9, 3, 1, 8, 10)) 

the code

 library(tidyr); library(dplyr); library(magrittr) df %<>% separate(Date, into=c("d1", "d2")) %>% arrange(ID, as.numeric(d2)) %>% group_by(ID) %>% mutate(last_Date=paste("Day", max(as.numeric(d2))), metric_Avg=mean(Metric), prev2=(lag(Metric)+lag(Metric, 2))/2, prev3=(lag(Metric)+lag(Metric, 2)+lag(Metric, 3))/3, prev4=(lag(Metric)+lag(Metric, 2)+lag(Metric, 3)+lag(Metric, 4))/4, prev5=(lag(Metric)+lag(Metric, 2)+lag(Metric, 3)+lag(Metric, 4)+lag(Metric, 5))/5) %>% ungroup %>% filter(last_Date==paste(d1, d2)) %>% select(ID, last_Date, metric_Avg, prev2, prev3, prev4, prev5) df 

Exit

  ID last_Date metric_Avg prev2 prev3 prev4 prev5 1 a Day 10 6.200 3.5 4.000 4.50 5.0 2 b Day 8 7.625 9.0 8.667 8.25 7.8 3 c Day 13 4.000 2.0 NA NA NA 4 d Day 8 10.000 NA NA NA NA 

Note

If the Date column contains dates, use the lubridate package. first few lines of code:

 df$Date <- ymd(df$Date) # id the Date is of the form yyyy-mm-dd or yyyy/mm/dd df %<>% arrange(ID, Date) %>% group_by(ID) %>% mutate(last_Date= max(Date)... 
+1
source

"lapply" may be useful:

 ID <- unique(data$ID) rowNr <- lapply(ID,function(id){which(data$ID==id)}) lastDate <- lapply(rowNr,function(n){data$Date[rev(n)[1]]}) metricAvg <- lapply(rowNr,function(n){mean(data$Metric[n])}) prev2 <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),3),2)])}) prev3 <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),4),3)])}) prev4 <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),5),4)])}) prev5 <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),6),5)])}) output <- data.frame( ID = ID, last_Date = unlist(lastDate), metric_Avg = unlist(metricAvg), prev2 = unlist(prev2), prev3 = unlist(prev3), prev4 = unlist(prev4), prev5 = unlist(prev5) ) 

Output:

 > output ID last_Date metric_Avg prev2 prev3 prev4 prev5 1 a Day 10 6.200 3.5 4.000000 4.50 5.0 2 b Day 8 7.625 9.0 8.666667 8.25 7.8 

Another example shows what happens if there are not enough days to calculate "prev5", "prev4", ...:

 > data ID Date Metric 1 a Day 1 9 2 a Day 2 8 3 a Day 3 9 4 a Day 4 8 5 a Day 5 7 6 a Day 6 6 7 a Day 7 5 8 a Day 8 4 9 a Day 9 3 10 a Day 10 3 11 b Day 1 6 12 b Day 2 7 13 b Day 3 6 14 b Day 4 7 15 b Day 5 8 16 b Day 6 9 17 b Day 7 9 18 b Day 8 9 19 c Day 11 3 20 c Day 12 1 21 c Day 13 8 22 d Day 8 10 

Output:

 > output ID last_Date metric_Avg prev2 prev3 prev4 prev5 1 a Day 10 6.200 3.5 4.000000 4.50 5.0 2 b Day 8 7.625 9.0 8.666667 8.25 7.8 3 c Day 13 4.000 2.0 NA NA NA 4 d Day 8 10.000 NA NA NA NA > 

This lightweight basic R solution is even faster than its overloaded competitor:

 > system.time( + for ( i in 1:5000) + { + ID <- unique(data$ID) + .... [TRUNCATED] user system elapsed 28.28 0.01 28.47 > #----------------------------------------------------------------- > > library(tidyr); library(dplyr); library(magrittr) > system.time( + for ( i in 1:5000) + { + df <-data + + df %<>% separate(Date, into=c("d1", "d2")) %>% + arrange(ID, as.numeri .... [TRUNCATED] user system elapsed 46.56 0.05 46.87 > 
0
source

All Articles