Effectively locf by groups in one R data.table

I have a large, wide data.table (20 meter rows) with a person id, but with lots of columns (~ 150) that have many null values. Each column is a recorded state / attribute that I want to transfer for each person. Each person can have from 10 to 10,000 observations, and in a set of about 500,000 people. Values ​​from one person cannot bleed to the next person, so my decision should respect the person’s identifier column and group, respectively.

For demonstration purposes - here is a very small sample input:

 DT = data.table( id=c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), aa=c("A", NA, "B", "C", NA, NA, "D", "E", "F", NA, NA, NA), bb=c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), cc=c(1, NA, NA, NA, NA, 4, NA, 5, 6, NA, 7, NA) ) 

It looks like this:

  id aa bb cc 1: 1 A NA 1 2: 1 NA NA NA 3: 1 B NA NA 4: 1 C NA NA 5: 2 NA NA NA 6: 2 NA NA 4 7: 2 D NA NA 8: 2 E NA 5 9: 3 F NA 6 10: 3 NA NA NA 11: 3 NA NA 7 12: 3 NA NA NA 

My expected result is as follows:

  id aa bb cc 1: 1 A NA 1 2: 1 A NA 1 3: 1 B NA 1 4: 1 C NA 1 5: 2 NA NA NA 6: 2 NA NA 4 7: 2 D NA 4 8: 2 E NA 5 9: 3 F NA 6 10: 3 F NA 6 11: 3 F NA 7 12: 3 F NA 7 

I found a data.table solution that works, but it is very slow on my large datasets:

 DT[, na.locf(.SD, na.rm=FALSE), by=id] 

I found equivalent solutions using dplyr that are equally slow.

 GRP = DT %>% group_by(id) data.table(GRP %>% mutate_each(funs(blah=na.locf(., na.rm=FALSE)))) 

I was hoping I could come up with "self- data.table " using the data.table functionality, but I just can't figure out what is right (I suspect I will need to use .N , but I just haven't figured out).

At this point, I think I will need to write something in Rcpp for the efficient use of grouped locf.

I'm new to R, but I'm not new to C ++, so I'm sure I can do this. I just feel that there should be an efficient way to do this in R using data.table .

+8
r dataframe data.table dplyr rcpp
source share
1 answer

A very simple na.locf can be built by sending ( cummax ) non- NA indices ( (!is.na(x)) * seq_along(x) ) and a subset, respectively:

 x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2) x[cummax((!is.na(x)) * seq_along(x))] # [1] 1 1 1 6 4 5 4 4 4 2 

This replicates na.locf argument na.rm = TRUE to get the behavior na.rm = FALSE , we just need to make sure that the first element in cummax is TRUE :

 x = c(NA, NA, 1, NA, 2) x[cummax(c(TRUE, tail((!is.na(x)) * seq_along(x), -1)))] #[1] NA NA 1 1 2 

In this case, we need to consider not only non- NA indices, but also indices in which the (ordered or ordered) column "id" changes the value:

 id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13) c(TRUE, id[-1] != id[-length(id)]) # [1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE 

Combining the above:

 id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13) x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2) x[cummax(((!is.na(x)) | c(TRUE, id[-1] != id[-length(id)])) * seq_along(x))] # [1] 1 1 NA 6 4 5 4 4 NA 2 

Notice that here we are OR first element with TRUE , i.e. make it equal to TRUE , thereby getting the behavior na.rm = FALSE .

And for this example:

 id_change = DT[, c(TRUE, id[-1] != id[-.N])] DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])] # id aa bb cc # 1: 1 A NA 1 # 2: 1 A NA 1 # 3: 1 B NA 1 # 4: 1 C NA 1 # 5: 2 NA NA NA # 6: 2 NA NA 4 # 7: 2 D NA 4 # 8: 2 E NA 5 # 9: 3 F NA 6 #10: 3 F NA 6 #11: 3 F NA 7 #12: 3 F NA 7 
+14
source share

All Articles