The right / fast way to change the data table.

I have a data table in R:

library(data.table) set.seed(1234) DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12)) DT xyv [1,] 1 A 12 [2,] 1 B 62 [3,] 1 A 60 [4,] 1 B 61 [5,] 2 A 83 [6,] 2 B 97 [7,] 2 A 1 [8,] 2 B 22 [9,] 3 A 99 [10,] 3 B 47 [11,] 3 A 63 [12,] 3 B 49 

I can easily sum the variable v into groups in data.table:

 out <- DT[,list(SUM=sum(v)),by=list(x,y)] out xy SUM [1,] 1 A 72 [2,] 1 B 123 [3,] 2 A 84 [4,] 2 B 119 [5,] 3 A 162 [6,] 3 B 96 

However, I would like to have groups (y) as columns, not rows. I can accomplish this using reshape :

 out <- reshape(out,direction='wide',idvar='x', timevar='y') out x SUM.A SUM.B [1,] 1 72 123 [2,] 2 84 119 [3,] 3 162 96 

Is there a more efficient way to change data after its aggregation? Is there a way to combine these operations in one step using data.table operations?

+65
r data.table
Aug 01 '11 at 17:27
source share
4 answers

The data.table package implements the faster melt/dcast (in C). It also has additional features, allowing melt and merge multiple columns. See New Effective overhaul using data.tables on Github.

The melt / dcast functions for data.table are available from version v.1.9.0, and the functions include:

  • There is no need to download the reshape2 package before casting. But if you want it to be loaded for other operations, load it before loading data.table .

  • dcast also a common S3. More dcast.data.table() . Just use dcast() .

  • melt :

    • able to melt columns of type "list".

    • gets variable.factor and value.factor , which by default are TRUE and FALSE respectively for compatibility with reshape2 . This allows you to directly control the output type of the variable and value columns (as factors or not).

    • melt.data.table na.rm = TRUE parameter is internally optimized to remove NA directly during melting and is therefore much more efficient.

    • NEW: melt can accept a list for measure.vars , and the columns specified in each element of the list will be merged together. This is facilitated by using patterns() . See Vignette or ?melt .

  • dcast :

    • takes a few fun.aggregate and a few value.var . See Vignette or ?dcast .

    • use the rowid() function directly in the formula to generate the column identifier, which is sometimes required to uniquely identify rows. Cm.? Dcast

  • Old benchmarks:

    • melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
    • dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.

Cologne Presentation Slide Reminder (December 2013) 32: Why not send a dcast to reshape2 ?

+71
Aug 02 '11 at 13:52
source share

This function is now implemented in data.table (since version 1.8.11), as seen from Zach's answer above.

I just saw this big piece of code from Arun here on SO . Therefore, I assume there is a solution to data.table . In relation to this problem:

 library(data.table) set.seed(1234) DT <- data.table(x=rep(c(1,2,3),each=1e6), y=c("A","B"), v=sample(1:100,12)) out <- DT[,list(SUM=sum(v)),by=list(x,y)] # edit (mnel) to avoid setNames which creates a copy # when calling `names<-` inside the function out[, as.list(setattr(SUM, 'names', y)), by=list(x)] }) x AB 1: 1 26499966 28166677 2: 2 26499978 28166673 3: 3 26500056 28166650 

This gives the same results as the DWin approach:

 tapply(DT$v,list(DT$x, DT$y), FUN=sum) AB 1 26499966 28166677 2 26499978 28166673 3 26500056 28166650 

Also, it is fast:

 system.time({ out <- DT[,list(SUM=sum(v)),by=list(x,y)] out[, as.list(setattr(SUM, 'names', y)), by=list(x)]}) ## user system elapsed ## 0.64 0.05 0.70 system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum)) ## user system elapsed ## 7.23 0.16 7.39 

UPDATE

So, this solution also works for unbalanced data sets (i.e. some combinations do not exist), you must first enter the ones indicated in the data table:

 library(data.table) set.seed(1234) DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14)) out <- DT[,list(SUM=sum(v)),by=list(x,y)] setkey(out, x, y) intDT <- expand.grid(unique(out[,x]), unique(out[,y])) setnames(intDT, c("x", "y")) out <- out[intDT] out[, as.list(setattr(SUM, 'names', y)), by=list(x)] 



Summary

Combining the comments to the above, here is a one-line solution:

 DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][, setNames(as.list(V1), paste(y)), by = x] 

It is also easy to change this to have more than just a sum, for example:

 DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][, setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x] # x A.sum B.sum A.mean B.mean #1: 1 72 123 36.00000 61.5 #2: 2 84 119 42.00000 59.5 #3: 3 187 96 62.33333 48.0 #4: 4 NA 81 NA 81.0 
+31
Mar 19 '13 at 23:25
source share

Data.table objects inherit from 'data.frame', so you can just use tapply:

 > tapply(DT$v,list(DT$x, DT$y), FUN=sum) AA BB a 72 123 b 84 119 c 162 96 
+20
Aug 01 '11 at 17:31
source share

You can use the dcast library from reshape2 . Here is the code

 # DUMMY DATA library(data.table) mydf = data.table( x = rep(1:3, each = 4), y = rep(c('A', 'B'), times = 2), v = rpois(12, 30) ) # USE RESHAPE2 library(reshape2) dcast(mydf, x ~ y, fun = sum, value_var = "v") 

NOTE. tapply solution will be much faster.

+7
Aug 01 '11 at 17:35
source share



All Articles