Turn the big data table.

Question

Turn the big data table.

I have a big data table in R:

library(data.table) set.seed(1234) n <- 1e+07*2 DT <- data.table( ID=sample(1:200000, n, replace=TRUE), Month=sample(1:12, n, replace=TRUE), Category=sample(1:1000, n, replace=TRUE), Qty=runif(n)*500, key=c('ID', 'Month') ) dim(DT)

I would like to rotate this data table so that the Category becomes a column. Unfortunately, since the number of categories is not constant in groups, I cannot use this answer .

Any ideas how I can do this?

/ edit: Based on joran's comments and flodel's answer, we really reformat the following data.table :

 agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]

This change can be achieved in several ways (so far I got some good answers), but what I'm really looking for is that it will scale well to data.table with millions of rows and hundreds to thousands of categories.

+7

r data.table

Zach Apr 04 '13 at 10:07

source share

4 answers

Like this?

 agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")] reshape(agg, v.names = "Qty", idvar = c("ID", "Month"), timevar = "Category", direction = "wide")

+3

flodel Apr 04 '13 at 22:25

source share

There is no special method for reformatting data.table .

Here is an approach that will work, but is rather overrated.

There is a feature request # 2619 Defining an LHS area in := to help make this simpler.

Here is a simple example

 # a data.table DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6)) # with not all categories represented DDD <- DD[1:5] # trying to make `a` columns containing `cc`. retaining `b` as a column # the unique values of `a` (you may want to sort this...) nn <- unique(DDD[,a]) # create the correct wide data.table # with NA of the correct class in each created column rows <- max(DDD[, .N, by = list(a,b)][,N]) DDw <- DDD[, setattr(replicate(length(nn), { # safe version of correct NA z <- cc[1] is.na(z) <-1 # using rows value calculated previously # to ensure correct size rep(z,rows)}, simplify = FALSE), 'names', nn), keyby = list(b)] # set key for binary search setkey(DDD, b, a) # The possible values of the b column ub <- unique(DDw[,b]) # nested loop doing things by reference, so should be # quick (the feature request would make this possible to # speed up using binary search joins. for(ii in ub){ for(jj in nn){ DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]] } } DDw # bdef # 1: a 1 2 3 # 2: a 4 2 3 # 3: b NA 5 NA # 4: b NA 5 NA

+3

mnel Apr 04 '13 at 23:19

source share

EDIT

I found this SO post , which includes the best way to insert Lack of rows in a data table. The fun_DT function fun_DT adjusted accordingly. Now the code is cleaner; I do not see any speed improvements though.

See my update on another post. Arun solution also works, but you must manually insert the missing combinations. Since you have more columns of identifiers (ID, month) here, I just came up with a dirty solution (first I create ID2, then create all the combinations of the ID2 category, then fill in the data.table and then reformat).

I am sure this is not the best solution, but if this FR is built in, these steps can be performed automatically.

The solutions are approximately the same in speed, although it would be interesting to see how it scales (my machine is too slow, so I do not want to increase n further ... the computer crashed often ;-)

 library(data.table) library(rbenchmark) fun_reshape <- function(n) { DT <- data.table( ID=sample(1:100, n, replace=TRUE), Month=sample(1:12, n, replace=TRUE), Category=sample(1:10, n, replace=TRUE), Qty=runif(n)*500, key=c('ID', 'Month') ) agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")] reshape(agg, v.names = "Qty", idvar = c("ID", "Month"), timevar = "Category", direction = "wide") } #UPDATED! fun_DT <- function(n) { DT <- data.table( ID=sample(1:100, n, replace=TRUE), Month=sample(1:12, n, replace=TRUE), Category=sample(1:10, n, replace=TRUE), Qty=runif(n)*500, key=c('ID', 'Month') ) agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")] agg[, ID2 := paste(ID, Month, sep="_")] setkey(agg, ID2, Category) agg <- agg[CJ(unique(ID2), unique(Category))] agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)] } library(rbenchmark) n <- 1e+07 benchmark(replications=10, fun_reshape(n), fun_DT(n)) test replications elapsed relative user.self sys.self user.child sys.child 2 fun_DT(n) 10 45.868 1 43.154 2.524 0 0 1 fun_reshape(n) 10 45.874 1 42.783 2.896 0 0

+2

Christoph_J Apr 04 '13 at 23:22

source share

Arun · Accepted Answer · 2014-03-13T14:58:06+0000

data.table implements faster versions of melt/dcast data.table specific methods (in C). It also adds additional capabilities for melting and casting multiple columns. See Effective form change using data.tables vignette.

Please note that we do not need to download the reshape2 package.

 library(data.table) set.seed(1234) n <- 1e+07*2 DT <- data.table( ID=sample(1:200000, n, replace=TRUE), Month=sample(1:12, n, replace=TRUE), Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit Qty=runif(n), key=c('ID', 'Month') ) dim(DT) > system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum)) # user system elapsed # 65.924 20.577 86.987 > dim(ans) # [1] 2399401 802

Turn the big data table.

More articles: