Create multiple lists based on multiple subsets of larger data.

When working in R, I have data with a similar structure below (code 1). And I'm looking to create a new data.frame with the following characteristics:

For each unique value of ID_1, I would like to have two new columns, one of which contains a list (ID_2 that share ID_1 and Direction == 1), and the other column contains a list (ID_2 that share ID_1 and Direction == 0), ( see next block of code 2)

Dataset 1 block (initial):

ID_1 ID_2 Direction 100001 1 1 100001 11 1 100001 111 1 100001 1111 0 100001 11111 0 100001 111111 0 100002 2 1 100002 22 1 100002 222 0 100002 2222 0 100003 3 1 100003 33 1 100003 333 1 100003 3333 0 100003 33333 0 100003 333333 1 100004 4 1 100004 44 1 

Converted to:

Dataset block 2 (desired result):

 ID_1 ID_2_D1 ID_2_D0 100001 1,11,111 1111,11111,111111 100002 2,22 222,222 100003 3,33,333,333333 3333,33333 100004 4,44 

I have a code that does this (taking the loops of a subset of the subsets), but I run it over many millions of unique β€œID_1s”, doing it a lot of time (hours, I tell you!).

Any advice - maybe using apply () or the plyr () package, which can make this work faster?


Code for reference:

 DF <- data.frame(ID_1=c(100001,100001,100001,100001,100001,100001,100002,100002,100002,100002,100003,100003,100003,100003,100003,100003,100004,100004) ,ID_2=c(1,11,111,1111,11111,111111,2,22,222,2222,3,33,333,3333,33333,333333,4,44) ,Direction=c(1,1,1,0,0,0,1,1,0,0,1,1,1,0,0,1,1,1) ) 

My current (too slow) code is:

  DF2 <- data.frame( ID_1=DF[!duplicated(DF$ID_1),][,1]) for (i in 1:length(unique(DF2$ID_1))){ DF2$ID_2_D1[i] <- list(subset(DF,ID_1==unique(DF2$ID_1)[i] & Direction==1)$ID_2) DF2$ID_2_D0[i] <- list(subset(DF,ID_1==unique(DF2$ID_1)[i] & Direction==0)$ID_2) } 
+4
source share
3 answers

Like this:

 library(reshape2) dcast(DF, ID_1 ~ Direction, value.var = "ID_2", list) # ID_1 0 1 # 1 100001 1111, 11111, 111111 1, 11, 111 # 2 100002 222, 2222 2, 22 # 3 100003 3333, 33333 3, 33, 333, 333333 # 4 100004 4, 44 
+7
source

@flodel's answer is by far the easiest one I can think of, but here is an option in the R base using aggregate and merge . It uses the " subset " argument in the aggregate step to get the individual columns for "Direction == 0" and "Direction == 1".

 temp1 <- aggregate(ID_2 ~ ., DF, as.vector, subset = c(Direction == 0)) temp2 <- aggregate(ID_2 ~ ., DF, as.vector, subset = c(Direction == 1)) merge(temp1[-2], temp2[-2], by = "ID_1", all = TRUE, suffixes=c("_0", "_1")) # ID_1 ID_2_0 ID_2_1 # 1 100001 1111, 11111, 111111 1, 11, 111 # 2 100002 222, 2222 2, 22 # 3 100003 3333, 33333 3, 33, 333, 333333 # 4 100004 NULL 4, 44 

A related approach (not sure if it will be faster) is to use split to create lapply subsets to aggregate from the resulting list and Reduce to facilitate merge

 Reduce(function(x, y) merge(x, y, by = "ID_1", all = TRUE, suffixes = c("_0", "_1")), lapply(split(DF[1:2], DF$Direction), function(x) aggregate(ID_2 ~ ID_1, x, as.vector))) 

And, of course, there is one approach here that uses data.table , which you might want to consider, since you mentioned the need to work * on many millions of unique "ID_1" s *. You are unlikely to see any benefit from this small example, but you should with actual data.

 library(data.table) DT <- data.table(DF, key = "ID_1") DT0 <- DT[Direction == 0, list(D0 = list(ID_2)), by = key(DT)] DT1 <- DT[Direction == 1, list(D1 = list(ID_2)), by = key(DT)] DT0[DT1] # ID_1 D0 D1 # 1: 100001 1111,11111,111111 1,11,111 # 2: 100002 222,2222 2,22 # 3: 100003 3333,33333 3,33,333,333333 # 4: 100004 4,44 

Update

As @Arun mentioned in the public chat room R, this is a simplified data.table approach that avoids creating two separate objects and merging them.

 DT[, list(list(D0 = ID_2[Direction==0]), list(D1 = ID_2[Direction == 1])), by=ID_1] 
+4
source

You can use the apply functions here. I'm not sure what you need (that is, you can get even faster, just a subset), but I can’t figure out how you will do it right now. You can achieve what you want:

 # Direction = 1 d1 <- lapply( unique( DF$ID_1 ) , function(x){ subset( DF , ID_1== x & Direction == 1)$ID_2 } ) d1 <- sapply( d1 , function(x){ paste0( x , sep = "," , collapse = "" ) } ) # Direction = 0 d0 <- lapply( unique( DF$ID_1 ) , function(x){ subset( DF , ID_1== x & Direction == 0)$ID_2 } ) d0 <- sapply( d0 , function(x){ paste0( x , sep = "," , collapse = "" ) } ) # Results dataframe resDF <- data.frame(ID_1 = unique(DF$ID_1), d1, d0) resDF d1 d0 [1,] "100001" "1,11,111," "1111,11111,111111," [2,] "100002" "2,22," "222,2222," [3,] "100003" "3,33,333,333333," "3333,33333," [4,] "100004" "4,44," "," 

I am interested to know how fast this path is.

+3
source

All Articles