Efficiently compute a linear combination of data.table columns

I have nc columns in data.table and nc in vector. I want to take a linear combination of columns, but I don’t know in advance which columns I will use. What is the most efficient way to do this?

Customization

 require(data.table) set.seed(1) n <- 1e5 nc <- 5 cf <- setNames(rnorm(nc),LETTERS[1:nc]) DT <- setnames(data.table(replicate(nc,rnorm(n))),LETTERS[1:nc]) 

ways to do it

Suppose I want to use the first four columns. I can write by hand:

 DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)] 

I can think of two automatic ways (which work without knowing what AE should be used):

 mycols <- LETTERS[1:4] # the first four columns DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols] DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols] 

benchmarking

I expect that as.matrix will make the second option slow and won't have any intuition for the speed of Map - Reduce combinations.

 require(rbenchmark) options(datatable.verbose=FALSE) # in case you have it turned on benchmark( manual=DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)], coerce=DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols], maprdc=DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols] )[,1:6] test replications elapsed relative user.self sys.self 2 coerce 100 2.47 1.342 1.95 0.51 1 manual 100 1.84 1.000 1.53 0.31 3 maprdc 100 2.40 1.304 1.62 0.75 

I get 5% to 40% slowdown relative to the manual approach when I repeat the benchmark call.

my application

The sizes here - n and length(mycols) - are close to what I'm working with, but I will do these calculations many times, changing the coefficient vector, cf

+7
performance r data.table linear-algebra
source share
2 answers

This is almost twice as fast for me than your manual version:

 Reduce("+", lapply(names(DT), function(x) DT[[x]] * cf[x])) benchmark(manual = DT[, list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)], reduce = Reduce('+', lapply(names(DT), function(x) DT[[x]] * cf[x]))) # test replications elapsed relative user.self sys.self user.child sys.child #1 manual 100 1.43 1.744 1.08 0.36 NA NA #2 reduce 100 0.82 1.000 0.58 0.24 NA NA 

And to repeat only mycols replace names(DT) with mycols in lapply .

+7
source share

Add this parameter to your test call:

 ops = as.matrix(DT) %*% cf 

On my device, this is 30% faster than the matrix multiplication you tried.

+1
source share

All Articles