Effective removal of missing values ​​from the beginning and end of several time series in 1 data frame

Using R, I'm trying to trim NA values ​​from the beginning and end of a data frame that contains multiple time series. I achieved my goal using the for loop and zoo package, but as expected, it is extremely inefficient in large data frames.

My data frame looks like this and contains 3 columns with each time series identified by its unique identifier. In this case, AAA, B and CCC.

id date value AAA 2010/01/01 NA AAA 2010/02/01 34 AAA 2010/03/01 35 AAA 2010/04/01 30 AAA 2010/05/01 NA AAA 2010/06/01 28 B 2010/01/01 NA B 2010/02/01 0 B 2010/03/01 1 B 2010/04/01 2 B 2010/05/01 3 B 2010/06/01 NA B 2010/07/01 NA B 2010/07/01 NA CCC 2010/01/01 0 CCC 2010/02/01 400 CCC 2010/03/01 300 CCC 2010/04/01 200 CCC 2010/05/01 NA 

I would like to know how I can effectively remove NA values ​​from the beginning and end of each time series, in this case AAA, B and CCC. Therefore, it should look like this.

 id date value AAA 2010/02/01 34 AAA 2010/03/01 35 AAA 2010/04/01 30 AAA 2010/05/01 NA AAA 2010/06/01 28 B 2010/02/01 0 B 2010/03/01 1 B 2010/04/01 2 B 2010/05/01 3 CCC 2010/01/01 0 CCC 2010/02/01 400 CCC 2010/03/01 300 CCC 2010/04/01 200 
source share
2 answers

I would do it like this, it should be very fast:

 require(data.table) DT = as.data.table(your data) # please provide something pastable DT2 = DT[!is.na(value)] setkey(DT,id,date) setkey(DT2,id,date) tokeep = DT2[DT,!is.na(value),rolltolast=TRUE,mult="last"] DT = DT[tokeep] 

This works by moving forward non-NA, but not past the last, within each group.

mult="last" is optional. It should speed it up if v1.8.0 is used (on CRAN). Interested in timings with him and without him. By default, data.table joins the groups ( mult="all" ), but in this case we join all the columns of the key, and we know that the key is unique; that is, there are no duplicates in the key. In v1.8.1 (in dev) there is no need to know about it, and it cares about you more.


If your data is in the data data frame

 fun <- function(x) { x$value[is.na(x$value)] <- "NA" tmp <- rle(x$value) values <- tmp$values lengths <- tmp$lengths n <- length(values) nr <- nrow(x) id <- c() if(values[1] == "NA") id <- c(id, 1:lengths[1]) if(values[n] == "NA") id <- c(id, (nr-lengths[n]+1):nr) if(length(id) == 0)return(x) x[-id,] } do.call(rbind, by(data, INDICES=data$id, FUN=fun)) 

Not the most elegant solution that I guess. In the mood for this post .


Source: https://habr.com/ru/post/1415073/

All Articles