TL; DR: When using later versions data.tablethat use auto-indexing, is there any use to using a %chin%subset of data.table in character columns?
In the past, using %chin%from data.tableinstead of %in%when a subset on character vectors has led to significant acceleration. In later versions, data.tablesecondary indexes are automatically created in non-key columns when substituted. The creation and use of these indexes seems to make any difference in speed between %chin%and %in%irrelevant.
Going forward, are there any cases where using a %chin%subset of data.table will still improve speed, or can I just use it %in%in the future?
Update: talking on PR # 2494: better subset optimization for compound queries seems to support the understanding of what is being evaluated in the data. table call environment, execution methods have %chin%been fundamentally changed.
In cases where a column is used for a subset of a table more than once, performance will increase significantly by automatic indexing, but when only one time is used (and, therefore, the time taken to create the index is not used), the inclusion of automatic indexing sometimes gives several faster results.
, .
:
, , .
library(data.table)
library(microbenchmark)
set.seed(1234)
DiverseSize <- 1e6
Diverse <- paste0(sample(LETTERS,DiverseSize,replace = TRUE),
sample(letters,DiverseSize,replace = TRUE),
sample(letters,DiverseSize,replace = TRUE),
sample(letters,DiverseSize,replace = TRUE))
CommonSize <- 1e7
Common <- sample(LETTERS,CommonSize,replace = TRUE)
DT1 <- data.table(x = sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE))
DT2 <- copy(DT1)
%in% %chin%
data.table %chin%.
microbenchmark(
Outside_chin = length(which(DT1[["x"]] %chin% c("Matt"))),
Outside_in = length(which(DT2[["x"]] %in% c("Matt"))),
times = 1
)
...
Unit: milliseconds
expr min lq mean median uq max neval
Outside_chin 254.5967 254.5967 254.5967 254.5967 254.5967 254.5967 1
Outside_in 476.2117 476.2117 476.2117 476.2117 476.2117 476.2117 1
%in% %chin%
## Benchmarking -------
## Turn off Indices
options(datatable.auto.index = FALSE)
options(datatable.use.index = FALSE)
## Run without indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]
## Run Again
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]
options(datatable.auto.index = TRUE)
options(datatable.use.index = TRUE)
## First run builds indices and takes longer
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]
## Run again, benefiting from pre-built indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]
ProfVis :
%chin% , .- ,
%chin% %in%. - , , Profvis 10

data.table 1.10.5, 2018-03-17 07:30:06 UTC.