I have a large, wide data.table (20 meter rows) with a person id, but with lots of columns (~ 150) that have many null values. Each column is a recorded state / attribute that I want to transfer for each person. Each person can have from 10 to 10,000 observations, and in a set of about 500,000 people. Values ββfrom one person cannot bleed to the next person, so my decision should respect the personβs identifier column and group, respectively.
For demonstration purposes - here is a very small sample input:
DT = data.table( id=c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), aa=c("A", NA, "B", "C", NA, NA, "D", "E", "F", NA, NA, NA), bb=c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), cc=c(1, NA, NA, NA, NA, 4, NA, 5, 6, NA, 7, NA) )
It looks like this:
id aa bb cc 1: 1 A NA 1 2: 1 NA NA NA 3: 1 B NA NA 4: 1 C NA NA 5: 2 NA NA NA 6: 2 NA NA 4 7: 2 D NA NA 8: 2 E NA 5 9: 3 F NA 6 10: 3 NA NA NA 11: 3 NA NA 7 12: 3 NA NA NA
My expected result is as follows:
id aa bb cc 1: 1 A NA 1 2: 1 A NA 1 3: 1 B NA 1 4: 1 C NA 1 5: 2 NA NA NA 6: 2 NA NA 4 7: 2 D NA 4 8: 2 E NA 5 9: 3 F NA 6 10: 3 F NA 6 11: 3 F NA 7 12: 3 F NA 7
I found a data.table solution that works, but it is very slow on my large datasets:
DT[, na.locf(.SD, na.rm=FALSE), by=id]
I found equivalent solutions using dplyr that are equally slow.
GRP = DT %>% group_by(id) data.table(GRP %>% mutate_each(funs(blah=na.locf(., na.rm=FALSE))))
I was hoping I could come up with "self- data.table " using the data.table functionality, but I just can't figure out what is right (I suspect I will need to use .N , but I just haven't figured out).
At this point, I think I will need to write something in Rcpp for the efficient use of grouped locf.
I'm new to R, but I'm not new to C ++, so I'm sure I can do this. I just feel that there should be an efficient way to do this in R using data.table .