I want to change the data.table and include the historical (total summed) information for each variable. The variable No indicates the chronological order of measurements for the ID object. Each measurement reveals additional information. I want to aggregate the known information at each timestamp No for the ID object.
Let me show you an example:
For the following .table data:
df <- data.table(ID=c(1,1,1,2,2,2,2), No=c(1,2,3,1,2,3,4), Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'), Value=c(2,1,3,3,2,1,5)) df ID No Variable Value 1: 1 1 a 2 2: 1 2 b 1 3: 1 3 a 3 4: 2 1 c 3 5: 2 2 a 2 6: 2 3 a 1 7: 2 4 b 5
I want to change it to this:
ID No abc 1: 1 1 2 NA NA 2: 1 2 2 1 NA 3: 1 3 5 1 NA 4: 2 1 NA NA 3 5: 2 2 2 NA 3 6: 2 3 3 NA 3 7: 2 4 3 5 3
So, the total values ββof Value , per Variable by (ID, No) , cumulative by No
I can get the result without the cumulative part by doing
dcast(df, ID+No~Variable, value.var="Value")
which leads to a non-cumulative option:
ID No abc 1: 1 1 2 NA NA 2: 1 2 NA 1 NA 3: 1 3 3 NA NA 4: 2 1 NA NA 3 5: 2 2 2 NA NA 6: 2 3 1 NA NA 7: 2 4 NA 5 NA
Any ideas how to make this cumulative? The original data table contains more than 250,000 rows, so efficiency matters.
EDIT: I just used a, b, c as an example, the source file has about 40 different levels. In addition, NAs are important; there are also Value values ββof 0, which means something else than NA
POSSIBLE SOLUTION
Ok, so I found a working solution. This is far from effective, as it enlarges the original table.
The idea is to duplicate each row of TotalNo - No times, where TotalNo is the maximum No per ID . You can then use the original dcast function to retrieve data. So in the code:
df[,TotalNo := .N, by=ID] df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))]
This is not very nice because duplicating uses more memory. I think it can be optimized, but so far it works for my purposes. In the sample code, it goes from 7 lines to 16 lines, in the source file from 241 670 lines to a whopping 978 331. This is 4 times more.
SOLUTION Eddi improved my computational time solution in a complete data set (2.08 seconds Eddi versus 4.36 seconds). These are the numbers I can work with! Thanks everyone!