Key performance and speed will depend on the types of variable keys. numeric columns will be slower than integer . The character columns (when short lines) look fast.
eg,
library(data.table) set.seed(1) DIC <- data.table(year = sample(seq_len(10), 5e6, TRUE), id = sample(as.character(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6)) DIC2 <- copy(DIC) DIF <- data.table(year = sample(seq_len(10), 5e6, TRUE), id = sample(as.factor(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6)) DIF2 <- copy(DIF) DNC <- data.table(year = sample(as.numeric(seq_len(10)), 5e6, TRUE), id = sample(as.character(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6)) DNC2 <- copy(DNC) DCC <- data.table(year = sample(as.character(seq_len(10)), 5e6, TRUE), id = sample(as.character(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6)) DCC2 <- copy(DCC) DII <- data.table(year = sample(seq_len(10), 5e6, TRUE), id = sample(seq_len(2e5), 5e6, TRUE), z = rnorm(5e6)) DII2 <- copy(DII)
Some timings
# key of integer, character columns system.time(setkey(DIC, year ,id)) user system elapsed 3.21 0.11 3.31 system.time(setkey(DIC2, id, year)) user system elapsed 3.43 0.03 3.45 # key of integer factor columns system.time(setkey(DIF, year ,id)) user system elapsed 6.31 0.05 6.37 system.time(setkey(DIF2, id, year)) user system elapsed 6.44 0.06 6.54 # key of numeric, character columns system.time(setkey(DNC, year ,id)) user system elapsed 9.91 0.07 10.29 system.time(setkey(DNC2, id, year)) user system elapsed 10.11 0.07 10.34 # key of two character columns system.time(setkey(DCC, year ,id)) user system elapsed 3.34 0.05 3.40 system.time(setkey(DCC2, id, year)) user system elapsed 3.40 0.02 3.42 # key of two integer columns system.time(setkey(DII, year ,id)) user system elapsed 6.25 0.02 6.53 system.time(setkey(DII2, id,year)) user system elapsed 6.44 0.05 6.64
How will it be better. This is likely to depend on what you are likely to multiply in private more often.
For example, you may need to get all the data for a year.
If you set the key as year, id , you can use
D[J(1)]
but if the key was set as id, year , then you need
D[J(unique(id),1), nomatch = 0]
which takes more text and takes longer, because it must calculate unique(id) .
There is a function request FR # 1007 that is considering using a secondary key, but this has not yet been implemented. Currently, there is one key that can occupy more than one column.