When joining two data tables and using by= in the same expression, I get an error when trying to use a column from the internal data table in j. I can break things down into two separate expressions, but this is an extra typing - and possibly a performance hit when using large datasets
As an example
require(data.table) DT1 <- data.table(k1 = 1:2, k2 = c('a', 'a', 'a', 'b', 'b', 'c'), v1 = 1:6, key = 'k2') DT2 <- data.table(k1 = c('a', 'b', 'c'), w1 = 3^(1:3), key = 'k1') DT1[DT2, sum(v1*w1), by=k1]
With small datasets, the connection and then the group approach is great. However, for data sets with many columns, creating an intermediate result with all the data.tables data columns is a significant burden (my actual data tables are about 1-2 GB in size).
While I could reduce the number of columns involved in the work
DT1[DT2[,.(k1, w1)]][,sum(v1*w1),by=k1]
which eliminates one of the large values โโof data.tables - you do not need to constantly indicate the relationship between data sets. It also requires me to remember a particular column in two different places every time I make a join.
Is there something obvious that I'm missing?