R data.table - group by column includes a list

Question

R data.table - group by column includes a list

I am trying to use a group using the data.table package in R.

start <- as.Date('2014-1-1') end <- as.Date('2014-1-6') time.span <- seq(start, end, "days") a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b')) date value group 1 2014-01-01 1 a 2 2014-01-02 2 a 3 2014-01-03 3 b 4 2014-01-04 4 b 5 2014-01-05 5 a 6 2014-01-06 6 b a[,mean(value),by=group] > group V1 1: a 2.6667 2: b 4.3333

This works great.

Since I work with dates, it may happen that a special date not only has one, but also two groups.

 a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b')) date value group 1 2014-01-01 1 a 2 2014-01-02 2 c("a", "b") 3 2014-01-03 3 b 4 2014-01-04 4 b 5 2014-01-05 5 a 6 2014-01-06 6 b a[,mean(value),by=group] > Error in `[.data.table`(a, , mean(value), by = group) : The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6).

I would like the date of the group with both groups to be used to calculate the average of group a, as well as group b.

Expected results:

 mean a: 2.6667 mean b: 3.75

Is this possible with the data.table package?

Update

thanks to akrun my original problem is resolved. After "splitting" the data table and in my case to calculate different factors (depending on the groups), I need a data table in its "original" form with unique rows based on the date. My solution so far:

 a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b')) b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)] date value group 1 2014-01-01 1 a 2 2014-01-02 2 a 3 2014-01-02 2 b 4 2014-01-03 3 b 5 2014-01-04 4 b 6 2014-01-05 5 a 7 2014-01-06 6 b # creates new column with mean based on group b[,factor := mean(value), by=group] #creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date] date value group factor 01/01/14 1 a 2.666666667 02/01/14 2 c("a", "b") 10 03/01/14 3 b 3.75 04/01/14 4 b 3.75 05/01/14 5 a 2.666666667 06/01/14 6 b 3.75

I think this is not an ideal way to do this, but it works. Any suggestions how I could make this better?

Alternative solution (really slow !!!):

 d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group] for(i in 1:NROW(a)){ y1 <- 1 for(j in a[i,group][[1]]){ y1 <- y1 * d[group==j, V1] } a[i, factor := y1] }

My quick solution:

 # split rows that more than one group b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)] # calculate mean of different groups b <- b[,factor := mean(value), by=group] # only keep date + factor columns b <- b[,.(date, factor)] # summarise rows by date b <- b[,lapply(.SD,prod), by=date] # add summarised factor column to initial data.table c <- merge(a,b,by='date')

Is there a chance to make it faster?

+4

r data.table

Randomdude Aug 22 '15 at 11:19

source share

1 answer

akrun · Accepted Answer · 2015-08-22T12:21:22+0000

One option is to group by sequence of rows, we unlist column list ('group'), paste list elements together ( toString(..) ), use cSplit from splitstackshape with direction='long' to convert it to " long "format, and then get the mean column of the" value ", using" grp "as a grouping variable.

 library(data.table) library(splitstackshape) a[, grp:= toString(unlist(group)), 1:nrow(a)] cSplit(a, 'grp', ', ', 'long')[, mean(value), grp] # grp V1 #1: a 2.666667 #2: b 3.750000

I just realized that another option using splitstackshape would be listCol_l , which is unlist a list in long form. Since the output is data.table , we can use the data.table methods to calculate mean . It’s much more compact to get mean .

  listCol_l(a, 'group')[, mean(value), group_ul] # group_ul V1 #1: a 2.666667 #2: b 3.750000

Or another option without using splitstackshape is to replicate the rows of the dataset using the length of the list element. lengths is a convenient wrapper for sapply(group, length) and much faster. Then we change the “group” column to unlist from the original “group” from the “a” dataset and get mean “value” grouped by “group”.

  a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)][, mean(value), by = group] # group V1 #1: a 2.666667 #2: b 3.750000

R data.table - group by column includes a list

More articles: