How to save combinations of variables that are not displayed in the input when grouped with data.table?

Question

How to save combinations of variables that are not displayed in the input when grouped with data.table?

Using the data.table package, is it possible to summarize data that stores combinations of variables that are not displayed in the input?

With the plyr package , I know how to do this with the .drop argument, for example:

require(plyr) df <- data.frame(categories = c(rep("A",3), rep("B",3), rep("C",3)), groups = c(rep(c("X", "Y"),4), "Z"), values = rep(1, 9)) df1 <- ddply(df, c("categories","groups"), .drop = F, summarise, sum = sum(values))

exit:

  categories groups sum 1 AX 2 2 AY 1 3 AZ 0 4 BX 1 5 BY 2 6 BZ 0 7 CX 1 8 CY 1 9 CZ 1

In this case, I save all combinations of groups / categories, even if its sum is 0.

+8

r data.table

Davi moreira Jan 23 '13 at 17:27

source share

1 answer

Matt dowle · Accepted Answer · 2013-01-23T18:16:48+0000

Great question. Here are two ways. They both use by-without-by.

 DT = as.data.table(df) setkey(DT,categories,groups) DT[CJ(unique(categories),unique(groups)), sum(values,na.rm=TRUE)] categories groups V1 1: AX 2 2: AY 1 3: AZ 0 4: BX 1 5: BY 2 6: BZ 0 7: CX 1 8: CY 1 9: CZ 1

where CJ stands for Cross Join, see ?CJ . by-without-by simply means that j is executed in each group to which each row i joins.

In truth, it looks complicated at first sight. The idea is that if you have a well-known subset of groups, this syntax is faster than grouping everything and then selecting only the results that you need. But in this case, you still want to have few advantages besides the ability to search for groups that do not exist in the data (which you cannot do with by ).

Another way is by first as usual, then attach the result of CJ() to this:

 DT[,sum(values),keyby='categories,groups'][CJ(unique(categories),unique(groups))] categories groups V1 1: AX 2 2: AY 1 3: AZ NA 4: BX 1 5: BY 2 6: BZ NA 7: CX 1 8: CY 1 9: CZ 1

but then you get NA instead of the desired 0. If necessary, you can replace it with set() . The second way could be faster, because the two unique calls are provided with much smaller input.

Both methods can be wrapped in small helper functions if you do this a lot.

How to save combinations of variables that are not displayed in the input when grouped with data.table?

More articles: