Will% chin% be used for a subset of the character columns of an automatically indexed data.table that ever speeds up?

Question

Will% chin% be used for a subset of the character columns of an automatically indexed data.table that ever speeds up?

TL; DR: When using later versions data.tablethat use auto-indexing, is there any use to using a %chin%subset of data.table in character columns?

In the past, using %chin%from data.tableinstead of %in%when a subset on character vectors has led to significant acceleration. In later versions, data.tablesecondary indexes are automatically created in non-key columns when substituted. The creation and use of these indexes seems to make any difference in speed between %chin%and %in%irrelevant.

Going forward, are there any cases where using a %chin%subset of data.table will still improve speed, or can I just use it %in%in the future?

Update: talking on PR # 2494: better subset optimization for compound queries seems to support the understanding of what is being evaluated in the data. table call environment, execution methods have %chin%been fundamentally changed.

In cases where a column is used for a subset of a table more than once, performance will increase significantly by automatic indexing, but when only one time is used (and, therefore, the time taken to create the index is not used), the inclusion of automatic indexing sometimes gives several faster results.

, .

:

10 , 26
1 , 456 976

, , .

library(data.table)
library(microbenchmark)
set.seed(1234)

## Create a vector of 1 million 4 character strings
## with 456,976 possible unique values 
DiverseSize <- 1e6
Diverse <- paste0(sample(LETTERS,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE))

## Create a vector of 10 million single character strings
## with 26 possible unique values
CommonSize  <- 1e7
Common <-  sample(LETTERS,CommonSize,replace = TRUE)

## Mix them into a data.table column, "x"
DT1 <- data.table(x = sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE))
## Make a deep copy to run independent comparisons
DT2 <- copy(DT1)

`%in%` `%chin%`

data.table %chin%.

microbenchmark(
  Outside_chin = length(which(DT1[["x"]] %chin% c("Matt"))),
  Outside_in   = length(which(DT2[["x"]] %in% c("Matt"))),
  times = 1
)

...

Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval
 Outside_chin 254.5967 254.5967 254.5967 254.5967 254.5967 254.5967     1
   Outside_in 476.2117 476.2117 476.2117 476.2117 476.2117 476.2117     1

`%in%` `%chin%`

## Benchmarking -------
## Turn off Indices
options(datatable.auto.index = FALSE)
options(datatable.use.index = FALSE)

## Run without indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

## Run Again
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

options(datatable.auto.index = TRUE)
options(datatable.use.index = TRUE)

## First run builds indices and takes longer
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

## Run again, benefiting from pre-built indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]