Why is it faster to calculate in `j` than with` $ `in` data.table`?

Question

Why is it faster to calculate in `j` than with` $ `in` data.table`?

It may have already been answered, and I missed it, but it is hard to find.

A very simple question: why dt[,x]is the whole tiny bit faster than dt$x?

Example:

dt<-data.table(id=1:1e7,var=rnorm(1e6))

test<-microbenchmark(times=100L,
                     dt[sample(1e7,size=200000),var],
                     dt[sample(1e7,size=200000),]$var)

test[,"expr"]<-c("in j","$")

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
    $ 14.28863 15.88779 18.84229 17.23109 18.41577 53.63473   100
 in j 14.35916 15.97063 18.87265 17.99266 18.37939 54.19944   100

I may not have chosen a better example, so feel free to suggest something more poignant.

In any case, the estimate in jis faster, at least in 75% of cases (although it seems that the upper upper tail, like the average above, the side note, would be nice if microbenchmarksome histograms could spit me out).

Why is this so?

+4

r data.table

MichaelChirico Apr 29 '15 at 23:14

source share

1 answer

mnel · Accepted Answer · 2015-04-29T23:41:29+0000

j [.data.table.

$ ( ) [.data.table, $

2 1, , .

`sampling (1e, 200000).

dt<-data.table(id=1:1e7,var=rnorm(1e6))
setkey(dt, id)
ii <- sample(1e7,size=200000)


microbenchmark("in j" = dt[.(ii),var], "$"=dt[.(ii)]$var, '[[' =dt[.(ii)][['var']], .subset2(dt[.(ii)],'var'), dt[.(ii)][[2]], dt[['var']][ii], dt$var[ii], .subset2(dt,'var')[ii] )
Unit: milliseconds
                       expr       min        lq      mean    median        uq       max neval cld
                       in j 39.491156 40.358669 41.570057 40.860342 41.485622 70.202441   100   b
                          $ 39.957211 40.561965 41.587420 41.136836 41.634584 69.928363   100   b
                         [[ 40.046558 40.515480 42.388432 41.244444 41.750946 72.224827   100   b
 .subset2(dt[.(ii)], "var") 39.772781 40.564077 41.561271 41.111630 41.635489 69.252222   100   b
             dt[.(ii)][[2]] 40.004300 40.513669 41.682526 40.927503 41.492866 72.986995   100   b
            dt[["var"]][ii]  4.432346  4.546898  4.946219  4.623416  4.755777 31.761115   100  a 
                 dt$var[ii]  4.440496  4.539502  4.668361  4.597457  4.729214  5.425125   100  a 
    .subset2(dt, "var")[ii]  4.365939  4.508261  4.660435  4.598815  4.703858  6.072289   100  a

Why is it faster to calculate in `j` than with` $ `in` data.table`?

More articles: