Na.locf populates NAs up to maxgap, even if gap> maxgap, with groups

I saw a solution to this, but I canโ€™t make it work for groups ( Fill the NA in the time series with only a limited number ), and thought that there should also be a more accurate way for this?

Let's say I have the following dt:

dt <- data.table(ID = c(rep("A", 10), rep("B", 10)), Price = c(seq(1, 10, 1), seq(11, 20, 1))) dt[c(1:2, 5:10), 2] <- NA dt[c(11:13, 15:19) ,2] <- NA dt ID Price 1: A NA 2: A NA 3: A 3 4: A 4 5: A NA 6: A NA 7: A NA 8: A NA 9: A NA 10: A NA 11: B NA 12: B NA 13: B NA 14: B 14 15: B NA 16: B NA 17: B NA 18: B NA 19: B NA 20: B 20 

What I would like to do is fill NA back and forth from the most recent value, not NA , but only up to two lines forward or backward.

I also need this to be done by the group (ID).

I tried using na.locf / na.approx with maxgap = x etc., but it does not fill NA where the gap between the values โ€‹โ€‹is not NA greater than maxgap . While I want to fill them back and forth, even if the gap between the values โ€‹โ€‹is not NA more than maxgap , but only two lines.

The end result should look something like this:

  ID Price Price_Fill 1: A NA 3 2: A NA 3 3: A 3 3 4: A 4 4 5: A NA 4 6: A NA 4 7: A NA NA 8: A NA NA 9: A NA NA 10: A NA NA 11: B NA NA 12: B NA 14 13: B NA 14 14: B 14 14 15: B NA 14 16: B NA 14 17: B NA NA 18: B NA 20 19: B NA 20 20: B 20 20 

In fact, my dataset is massive, and I want to be able to fill NA back and forth to 672 rows, but no more, in groups.

Thanks!

+6
source share
1 answer

In the above example, we group by 'ID', get shift from "Price" with n = 0:2 and type as "leading" to create 3 temporary columns, get pmax from this, use the output for shift with type = 'lag' ( the default is "lag") and the same n , get pmin and assign it as "Price_Fill"

 dt[, Price_Fill := do.call(pmin, c(shift(do.call(pmax, c(shift(Price, n = 0:2, type = "lead"), na.rm=TRUE)), n= 0:2), na.rm = TRUE)) , by = ID] dt # ID Price Price_Fill #1: A NA 3 #2: A NA 3 #3: A 3 3 #4: A 4 4 #5: A NA 4 #6: A NA 4 #7: A NA NA #8: A NA NA #9: A NA NA #10: A NA NA #11: B NA NA #12: B NA 14 #13: B NA 14 #14: B 14 14 #15: B NA 14 #16: B NA 14 #17: B NA NA #18: B NA 20 #19: B NA 20 #20: B 20 20 

A more general approach would be to make pmin/pmax on .I , as the "Price" may differ, not the sequence number, as shown in the OP message.

 i1 <- dt[, do.call(pmin, c(shift(do.call(pmax, c(shift(NA^(is.na(Price))* .I, n = 0:2, type = "lead"), na.rm = TRUE)), n = 0:2), na.rm = TRUE)), ID]$V1 dt$Price_Fill < dt$Price[i1] dt$Price_Fill #[1] 3 3 3 4 4 4 NA NA NA NA NA 14 14 14 14 14 NA 20 20 20 

i.e. suppose we change the "Price", it will be different

 dt$Price[3] <- 10 dt$Price[14] <- 7 dt$Price_Fill <- dt$Price[i1] dt$Price_Fill #[1] 10 10 10 4 4 4 NA NA NA NA NA 7 7 7 7 7 NA 20 20 20 
+4
source

All Articles