Take the difference by columns in the data table.

How can I use the data.table syntax to create a data.table where each column contains the differences between the column of the original data table and the next column?

Example: I have a data table. Each row represents a group, and each column survives after year 0, after year 1, 2, etc. For instance,

pop <- data.table(group_id = c(1, 2, 3), N = c(4588L, 4589L, 4589L), N_surv_1 = c(4213, 4243, 4264), N_surv_2 = c(3703, 3766, 3820), N_surv_3 = c(2953, 3054, 3159) ) # group_id N N_surv_1 N_surv_2 N_surv_3 # 1 4588 4213 3703 2953 # 2 4589 4243 3766 3054 # 3 4589 4264 3820 3159 

(Data types are different because N is a true integer, and N_surv_1, etc. are projections that can be fractional.)

What I did: using the base diff binding and matrix, we can:

 diff <- data.table(t(diff(t(as.matrix(pop[,-1,with=FALSE]))))) setnames(diff, paste0("deaths_",1:ncol(diff))) cbind(group_id = pop[,group_id],diff) # produces desired output: # group_id deaths_1 deaths_2 deaths_3 # 1 -375 -510 -750 # 2 -346 -477 -712 # 3 -325 -444 -661 

I know that I can use base diff by group in one column created by melt.data.table , so this works, but not very nicely:

 melt(pop, id.vars = "group_id" )[order(group_id)][, setNames(as.list(diff(value)), paste0("deaths_",1:(ncol(pop)-2)) ), keyby = group_id] 

Is this the most typical way for data.table-riffic, or is there a way to do this as an operation with multiple columns in data.table?

+6
source share
3 answers

Well, you could subtract subsets:

 ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE) pop[, Map( `-`, utils:::tail.default(.SD, -1), utils:::head.default(.SD, -1) ), .SDcols=ncols] # N_surv_1 N_surv_2 N_surv_3 # 1: -375 -510 -750 # 2: -346 -477 -712 # 3: -325 -444 -661 

You can assign these values ​​to new columns with := . I have no idea why tail and head not becoming more accessible ... As @akrun pointed out, instead of with=FALSE you can use with=FALSE .

Anyway, this is pretty confusing compared to a simple remake:

 melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id] # group_id V1 # 1: 1 -375 # 2: 1 -510 # 3: 1 -750 # 4: 2 -346 # 5: 2 -477 # 6: 2 -712 # 7: 3 -325 # 8: 3 -444 # 9: 3 -661 
+6
source

Without changing the data and each row with a unique identifier, you can group by the id column, and then calculate the difference with diff in each row, i.e. unlist(.SD) :

 pop[, setNames(as.list(diff(unlist(.SD))), paste0("deaths_", 1:(ncol(pop)-2))), group_id] # group_id deaths_1 deaths_2 deaths_3 # 1: 1 -375 -510 -750 # 2: 2 -346 -477 -712 # 3: 3 -325 -444 -661 

Essentially something like this if you ignore the setting of column names:

 pop[, as.list(diff(unlist(.SD))), group_id] 
+2
source

Here is another way to do this without changing or grouping, which could speed things up. If this is a small number of lines, then this will probably not be a noticeable difference.

 cols<-names(pop)[-1] combs<-list() for(i in 2:length(cols)) { combs[[length(combs)+1]]<-c(cols[i-1], cols[i]) } newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2])) deathpop<-copy(pop) deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))] deathpop[,(cols):=NULL] 

I did some tests

 rows<-10000000 pop <- data.table(group_id = 1:rows, N = runif(rows,3000,4000), N_surv_1 = runif(rows,3000,4000), N_surv_2 = runif(rows,3000,4000), N_surv_3 = runif(rows,3000,4000)) system.time({ cols<-names(pop)[-1] combs<-list() for(i in 2:length(cols)) { combs[[length(combs)+1]]<-c(cols[i-1], cols[i]) } newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2])) deathpop<-copy(pop) deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))] deathpop[,(cols):=NULL]}) 

and he returned

 user system elapsed 0.192 0.808 1.003 

In contrast, I did

 system.time(pop[, as.list(diff(unlist(.SD))), group_id]) 

and he returned

  user system elapsed 169.836 0.428 170.469 

I also did

 system.time({ ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE) pop[, Map( `-`, utils:::tail.default(.SD, -1), utils:::head.default(.SD, -1) ), .SDcols=ncols] }) 

who returned

  user system elapsed 0.044 0.044 0.089 

Finally making

 system.time(melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id]) 

returns

  user system elapsed 223.360 1.736 225.315 

Frank Map solution is the fastest. If you take a copy from mine, then it becomes much closer to Frank's time, but he still wins in this case.

+2
source

All Articles