Cbind specific columns from multiple data.tables efficiently

Question

Cbind specific columns from multiple data.tables efficiently

I have a list of data.tables that I need to bind, however I only need the last X columns.

My data is structured as follows:

DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4)) DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6)) DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12)) DT.list <- list(DT.1, DT.2, DT.3) >DT.list [[1]] xy v1 v2 1: 1 a 1 3 2: 1 a 2 4 [[2]] xy v3 1: 1 a 5 2: 1 a 6 [[3]] xy v4 v5 v6 1: 1 a 7 9 11 2: 1 a 8 10 12

The columns x and y are the same for each of the data.tables, but the number of columns is different. The output should not include repeating columns x and y. It should look like this:

  xy v1 v2 v3 v4 v5 v6 1: 1 a 1 3 5 7 9 11 2: 1 a 2 4 6 8 10 12

I want to avoid using a loop. I can bind data.tables using do.call("cbind", DT.list) and then delete duplicates manually, but is there a way when duplicates are not created in the first place? In addition, efficiency is important because lists can be long with big data. Tables.

thanks

+5

r data.table cbind

graybag Jul 15 '15 at 12:49

source share

3 answers

Here's another way:

 Reduce( function(x,y){ newcols = setdiff(names(y),names(x)) x[,(newcols)] <- y[,newcols,with=FALSE] x }, DT.list, init = copy(DT.list[[1]][,c("x","y"),with=FALSE]) ) # xy v1 v2 v3 v4 v5 v6 # 1: 1 a 1 3 5 7 9 11 # 2: 1 a 2 4 6 8 10 12

This avoids changing the list (as @bgoldst <- NULL does) or creating copies of each element of the list (as I think the lapply approach). I would probably do the <- NULL thing in most practical applications.

+2

Frank Jul 15 '15 at 14:55

source share

Another option is to use the function of the function [,, indexing inside lapply in the list of data tables and eliminating the “unwanted” columns (in your case x and y ). Therefore, duplicate columns are not created.

 # your given test data DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4)) DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6)) DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12)) DT.list <- list(DT.1, DT.2, DT.3)

A) using a character vector indicating which columns to exclude

 # cbind a list of subsetted data.tables exclude.col <- c("x","y") myDT <- do.call(cbind, lapply(DT.list, `[`,,!exclude.col, with = FALSE)) myDT ## v1 v2 v3 v4 v5 v6 ## 1: 1 3 5 7 9 11 ## 2: 2 4 6 8 10 12 # join x & y columns for final results cbind(DT.list[[1]][,.(x,y)], myDT) ## xy v1 v2 v3 v4 v5 v6 ## 1: 1 a 1 3 5 7 9 11 ## 2: 1 a 2 4 6 8 10 12

B) as above, but using the character vector directly in `lapply`

 myDT <- do.call(cbind, lapply(DT.list, `[`,,!c("x","y"))) myDT ## v1 v2 v3 v4 v5 v6 ## 1: 1 3 5 7 9 11 ## 2: 2 4 6 8 10 12 # join x & y columns for final results cbind(DT.list[[1]][,.(x,y)], myDT) ## xy v1 v2 v3 v4 v5 v6 ## 1: 1 a 1 3 5 7 9 11 ## 2: 1 a 2 4 6 8 10 12

C) as above but all on one line

 do.call( cbind, c(list(DT.list[[1]][,.(x,y)]), lapply(DT.list, `[`,,!c("x","y"))) ) # way too many brackets...but I think it works ## xy v1 v2 v3 v4 v5 v6 ## 1: 1 a 1 3 5 7 9 11 ## 2: 1 a 2 4 6 8 10 12

0

Valentin Sep 26 '17 at 10:43

source share

bgoldst · Accepted Answer · 2015-07-15T12:57:51+0000

Here's how to do it in one shot, using lapply() to remove the x and y columns from the second and subsequent data.tables before calling cbind() :

 do.call(cbind,c(DT.list[1],lapply(DT.list[2:length(DT.list)],`[`,j=-c(1,2),with=F))); ## xy v1 v2 v3 v4 v5 v6 ## 1: 1 a 1 3 5 7 9 11 ## 2: 1 a 2 4 6 8 10 12

Another approach is to remove the x and y columns from the second and subsequent data.tables before executing direct cbind() . I think there is nothing wrong with using a for loop for this:

 for (i in seq_along(DT.list)[-1]) DT.list[[i]][,c('x','y')] <- NULL; DT.list; ## [[1]] ## xy v1 v2 ## 1: 1 a 1 3 ## 2: 1 a 2 4 ## ## [[2]] ## v3 ## 1: 5 ## 2: 6 ## ## [[3]] ## v4 v5 v6 ## 1: 7 9 11 ## 2: 8 10 12 ## do.call(cbind,DT.list); ## xy v1 v2 v3 v4 v5 v6 ## 1: 1 a 1 3 5 7 9 11 ## 2: 1 a 2 4 6 8 10 12

Cbind specific columns from multiple data.tables efficiently

A) using a character vector indicating which columns to exclude

B) as above, but using the character vector directly in lapply

C) as above but all on one line

More articles:

B) as above, but using the character vector directly in `lapply`