Does the order of keys in data.table mean?

Question

Does the order of keys in data.table mean?

I have a data table. It has two keys: Year (10 levels) and MemberID (200,000 levels). When I set the button, setkey(MemberID, Year) produce a different comparison result with setkey(Year, MemberID) ? If so, how will it be better?

+7

r data.table

AdamNyc Dec 04 '12 at 12:49

source share

1 answer

mnel · Accepted Answer · 2012-12-04T01:22:43+0000

Key performance and speed will depend on the types of variable keys. numeric columns will be slower than integer . The character columns (when short lines) look fast.

eg,

  library(data.table) set.seed(1) DIC <- data.table(year = sample(seq_len(10), 5e6, TRUE), id = sample(as.character(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6)) DIC2 <- copy(DIC) DIF <- data.table(year = sample(seq_len(10), 5e6, TRUE), id = sample(as.factor(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6)) DIF2 <- copy(DIF) DNC <- data.table(year = sample(as.numeric(seq_len(10)), 5e6, TRUE), id = sample(as.character(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6)) DNC2 <- copy(DNC) DCC <- data.table(year = sample(as.character(seq_len(10)), 5e6, TRUE), id = sample(as.character(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6)) DCC2 <- copy(DCC) DII <- data.table(year = sample(seq_len(10), 5e6, TRUE), id = sample(seq_len(2e5), 5e6, TRUE), z = rnorm(5e6)) DII2 <- copy(DII)

Some timings

 # key of integer, character columns system.time(setkey(DIC, year ,id)) user system elapsed 3.21 0.11 3.31 system.time(setkey(DIC2, id, year)) user system elapsed 3.43 0.03 3.45 # key of integer factor columns system.time(setkey(DIF, year ,id)) user system elapsed 6.31 0.05 6.37 system.time(setkey(DIF2, id, year)) user system elapsed 6.44 0.06 6.54 # key of numeric, character columns system.time(setkey(DNC, year ,id)) user system elapsed 9.91 0.07 10.29 system.time(setkey(DNC2, id, year)) user system elapsed 10.11 0.07 10.34 # key of two character columns system.time(setkey(DCC, year ,id)) user system elapsed 3.34 0.05 3.40 system.time(setkey(DCC2, id, year)) user system elapsed 3.40 0.02 3.42 # key of two integer columns system.time(setkey(DII, year ,id)) user system elapsed 6.25 0.02 6.53 system.time(setkey(DII2, id,year)) user system elapsed 6.44 0.05 6.64

How will it be better. This is likely to depend on what you are likely to multiply in private more often.

For example, you may need to get all the data for a year.

If you set the key as year, id , you can use

 D[J(1)]

but if the key was set as id, year , then you need

 D[J(unique(id),1), nomatch = 0]

which takes more text and takes longer, because it must calculate unique(id) .

There is a function request FR # 1007 that is considering using a secondary key, but this has not yet been implemented. Currently, there is one key that can occupy more than one column.

Does the order of keys in data.table mean?

Some timings

More articles: