Effective removal of missing values ​​from the beginning and end of several time series in 1 data frame

Using R, I'm trying to trim NA values ​​from the beginning and end of a data frame that contains multiple time series. I achieved my goal using the for loop and zoo package, but as expected, it is extremely inefficient in large data frames.

My data frame looks like this and contains 3 columns with each time series identified by its unique identifier. In this case, AAA, B and CCC.

id date value AAA 2010/01/01 NA AAA 2010/02/01 34 AAA 2010/03/01 35 AAA 2010/04/01 30 AAA 2010/05/01 NA AAA 2010/06/01 28 B 2010/01/01 NA B 2010/02/01 0 B 2010/03/01 1 B 2010/04/01 2 B 2010/05/01 3 B 2010/06/01 NA B 2010/07/01 NA B 2010/07/01 NA CCC 2010/01/01 0 CCC 2010/02/01 400 CCC 2010/03/01 300 CCC 2010/04/01 200 CCC 2010/05/01 NA 

I would like to know how I can effectively remove NA values ​​from the beginning and end of each time series, in this case AAA, B and CCC. Therefore, it should look like this.

 id date value AAA 2010/02/01 34 AAA 2010/03/01 35 AAA 2010/04/01 30 AAA 2010/05/01 NA AAA 2010/06/01 28 B 2010/02/01 0 B 2010/03/01 1 B 2010/04/01 2 B 2010/05/01 3 CCC 2010/01/01 0 CCC 2010/02/01 400 CCC 2010/03/01 300 CCC 2010/04/01 200 
+4
source share
2 answers

I would do it like this, it should be very fast:

 require(data.table) DT = as.data.table(your data) # please provide something pastable DT2 = DT[!is.na(value)] setkey(DT,id,date) setkey(DT2,id,date) tokeep = DT2[DT,!is.na(value),rolltolast=TRUE,mult="last"] DT = DT[tokeep] 

This works by moving forward non-NA, but not past the last, within each group.

mult="last" is optional. It should speed it up if v1.8.0 is used (on CRAN). Interested in timings with him and without him. By default, data.table joins the groups ( mult="all" ), but in this case we join all the columns of the key, and we know that the key is unique; that is, there are no duplicates in the key. In v1.8.1 (in dev) there is no need to know about it, and it cares about you more.

+7
source

If your data is in the data data frame

 fun <- function(x) { x$value[is.na(x$value)] <- "NA" tmp <- rle(x$value) values <- tmp$values lengths <- tmp$lengths n <- length(values) nr <- nrow(x) id <- c() if(values[1] == "NA") id <- c(id, 1:lengths[1]) if(values[n] == "NA") id <- c(id, (nr-lengths[n]+1):nr) if(length(id) == 0)return(x) x[-id,] } do.call(rbind, by(data, INDICES=data$id, FUN=fun)) 

Not the most elegant solution that I guess. In the mood for this post .

0
source

Source: https://habr.com/ru/post/1415073/


All Articles