Rearrange data. Table with total

I want to change the data.table and include the historical (total summed) information for each variable. The variable No indicates the chronological order of measurements for the ID object. Each measurement reveals additional information. I want to aggregate the known information at each timestamp No for the ID object.

Let me show you an example:

For the following .table data:

 df <- data.table(ID=c(1,1,1,2,2,2,2), No=c(1,2,3,1,2,3,4), Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'), Value=c(2,1,3,3,2,1,5)) df ID No Variable Value 1: 1 1 a 2 2: 1 2 b 1 3: 1 3 a 3 4: 2 1 c 3 5: 2 2 a 2 6: 2 3 a 1 7: 2 4 b 5 

I want to change it to this:

  ID No abc 1: 1 1 2 NA NA 2: 1 2 2 1 NA 3: 1 3 5 1 NA 4: 2 1 NA NA 3 5: 2 2 2 NA 3 6: 2 3 3 NA 3 7: 2 4 3 5 3 

So, the total values ​​of Value , per Variable by (ID, No) , cumulative by No

I can get the result without the cumulative part by doing

 dcast(df, ID+No~Variable, value.var="Value") 

which leads to a non-cumulative option:

  ID No abc 1: 1 1 2 NA NA 2: 1 2 NA 1 NA 3: 1 3 3 NA NA 4: 2 1 NA NA 3 5: 2 2 2 NA NA 6: 2 3 1 NA NA 7: 2 4 NA 5 NA 

Any ideas how to make this cumulative? The original data table contains more than 250,000 rows, so efficiency matters.

EDIT: I just used a, b, c as an example, the source file has about 40 different levels. In addition, NAs are important; there are also Value values ​​of 0, which means something else than NA

POSSIBLE SOLUTION

Ok, so I found a working solution. This is far from effective, as it enlarges the original table.

The idea is to duplicate each row of TotalNo - No times, where TotalNo is the maximum No per ID . You can then use the original dcast function to retrieve data. So in the code:

 df[,TotalNo := .N, by=ID] df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)] df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)] df4<- dcast(df3, formula = ID + No ~ Variable, value.var = "Value", fill=NA, fun.aggregate = sum) 

This is not very nice because duplicating uses more memory. I think it can be optimized, but so far it works for my purposes. In the sample code, it goes from 7 lines to 16 lines, in the source file from 241 670 lines to a whopping 978 331. This is 4 times more.

SOLUTION Eddi improved my computational time solution in a complete data set (2.08 seconds Eddi versus 4.36 seconds). These are the numbers I can work with! Thanks everyone!

+6
source share
3 answers

Your solution is good, but you add too many lines that are not needed if you calculated cumsum in advance:

 # add useful columns df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)] # do a rolling join to extend the missing values, and then dcast dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)], on = c('ID', 'Variable', 'No'), roll = TRUE], ID + No ~ Variable, value.var = 'CumValue') # ID No abc #1: 1 1 2 NA NA #2: 1 2 2 1 NA #3: 1 3 5 1 NA #4: 2 1 NA NA 3 #5: 2 2 2 NA 3 #6: 2 3 3 NA 3 #7: 2 4 3 5 3 
+3
source

Here's the standard way:

 library(zoo) df[, cv := cumsum(Value), by = .(ID, Variable)] DT = dcast(df, ID + No ~ Variable, value.var="cv") lvls = sort(unique(df$Variable)) DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls] ID No abc 1: 1 1 2 NA NA 2: 1 2 2 1 NA 3: 1 3 5 1 NA 4: 2 1 NA NA 3 5: 2 2 2 NA 3 6: 2 3 3 NA 3 7: 2 4 3 5 3 
+2
source

One alternative way to do this is to use a custom built-in summation function. This is exactly the method in @David Arenburg's comment, but it replaces the custom cumulative summary function.

EDIT: Using @eddi is a much more efficient cumulative sum user function.

 cumsum.na <- function(z){ Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T) } cols <- sort(unique(df$Variable)) res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID] res ID No abc 1: 1 1 2 NA NA 2: 1 2 2 1 NA 3: 1 3 5 1 NA 4: 2 1 NA NA 3 5: 2 2 2 NA 3 6: 2 3 3 NA 3 7: 2 4 3 5 3 

This is definitely not the most efficient, but it does its job and gives you a truly very slow very slow aggregate aggregate function that processes the NA the way you want.

+1
source

All Articles