R data.table - group by column includes a list

I am trying to use a group using the data.table package in R.

start <- as.Date('2014-1-1') end <- as.Date('2014-1-6') time.span <- seq(start, end, "days") a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b')) date value group 1 2014-01-01 1 a 2 2014-01-02 2 a 3 2014-01-03 3 b 4 2014-01-04 4 b 5 2014-01-05 5 a 6 2014-01-06 6 b a[,mean(value),by=group] > group V1 1: a 2.6667 2: b 4.3333 

This works great.

Since I work with dates, it may happen that a special date not only has one, but also two groups.

 a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b')) date value group 1 2014-01-01 1 a 2 2014-01-02 2 c("a", "b") 3 2014-01-03 3 b 4 2014-01-04 4 b 5 2014-01-05 5 a 6 2014-01-06 6 b a[,mean(value),by=group] > Error in `[.data.table`(a, , mean(value), by = group) : The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6). 

I would like the date of the group with both groups to be used to calculate the average of group a, as well as group b.

Expected results:

 mean a: 2.6667 mean b: 3.75 

Is this possible with the data.table package?

Update

thanks to akrun my original problem is resolved. After "splitting" the data table and in my case to calculate different factors (depending on the groups), I need a data table in its "original" form with unique rows based on the date. My solution so far:

 a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b')) b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)] date value group 1 2014-01-01 1 a 2 2014-01-02 2 a 3 2014-01-02 2 b 4 2014-01-03 3 b 5 2014-01-04 4 b 6 2014-01-05 5 a 7 2014-01-06 6 b # creates new column with mean based on group b[,factor := mean(value), by=group] #creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date] date value group factor 01/01/14 1 a 2.666666667 02/01/14 2 c("a", "b") 10 03/01/14 3 b 3.75 04/01/14 4 b 3.75 05/01/14 5 a 2.666666667 06/01/14 6 b 3.75 

I think this is not an ideal way to do this, but it works. Any suggestions how I could make this better?

Alternative solution (really slow !!!):

 d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group] for(i in 1:NROW(a)){ y1 <- 1 for(j in a[i,group][[1]]){ y1 <- y1 * d[group==j, V1] } a[i, factor := y1] } 

My quick solution:

 # split rows that more than one group b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)] # calculate mean of different groups b <- b[,factor := mean(value), by=group] # only keep date + factor columns b <- b[,.(date, factor)] # summarise rows by date b <- b[,lapply(.SD,prod), by=date] # add summarised factor column to initial data.table c <- merge(a,b,by='date') 

Is there a chance to make it faster?

+4
source share
1 answer

One option is to group by sequence of rows, we unlist column list ('group'), paste list elements together ( toString(..) ), use cSplit from splitstackshape with direction='long' to convert it to " long "format, and then get the mean column of the" value ", using" grp "as a grouping variable.

 library(data.table) library(splitstackshape) a[, grp:= toString(unlist(group)), 1:nrow(a)] cSplit(a, 'grp', ', ', 'long')[, mean(value), grp] # grp V1 #1: a 2.666667 #2: b 3.750000 

I just realized that another option using splitstackshape would be listCol_l , which is unlist a list in long form. Since the output is data.table , we can use the data.table methods to calculate mean . Itโ€™s much more compact to get mean .

  listCol_l(a, 'group')[, mean(value), group_ul] # group_ul V1 #1: a 2.666667 #2: b 3.750000 

Or another option without using splitstackshape is to replicate the rows of the dataset using the length of the list element. lengths is a convenient wrapper for sapply(group, length) and much faster. Then we change the โ€œgroupโ€ column to unlist from the original โ€œgroupโ€ from the โ€œaโ€ dataset and get mean โ€œvalueโ€ grouped by โ€œgroupโ€.

  a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)][, mean(value), by = group] # group V1 #1: a 2.666667 #2: b 3.750000 
+4
source

All Articles