Identify groups of related episodes that join together

Take this simple data frame of related identifiers:

test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11)) > test id1 id2 1 10 1 2 10 36 3 1 24 4 1 45 5 24 300 6 8 11 

Now I want to combine all the identifiers that are related. By “link”, I mean by following the chain of links so that all identifiers in the same group are marked together. View of a branching structure. i.e:

 Group 1 10 --> 1, 1 --> (24,45) 24 --> 300 300 --> NULL 45 --> NULL 10 --> 36, 36 --> NULL, Final group members: 10,1,24,36,45,300 Group 2 8 --> 11 11 --> NULL Final group members: 8,11 

Now I roughly know the logic that I would like, but I do not know how I will implement it elegantly. I am thinking of recursively using match or %in% to go down each branch, but this time really puzzled.

The end result that I would pursue is:

 result <- data.frame(group=c(1,1,1,1,1,1,2,2),id=c(10,1,24,36,45,300,8,11)) > result group id 1 1 10 2 1 1 3 1 24 4 1 36 5 1 45 6 1 300 7 2 8 8 2 11 
+5
source share
3 answers

The Bioconductor RBGL package (R interface to the BOOST graph library) contains a connectedComp() function that identifies related components in a graph - just what you want.

(To use this function, you first need to install the graph and RBGL packages , available here and here .)

 library(RBGL) test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11)) ## Convert your 'from-to' data to a 'node and edge-list' representation ## used by the 'graph' & 'RBGL' packages g <- ftM2graphNEL(as.matrix(test)) ## Extract the connected components cc <- connectedComp(g) ## Massage results into the format you're after ld <- lapply(seq_along(cc), function(i) data.frame(group = names(cc)[i], id = cc[[i]])) do.call(rbind, ld) # group id # 1 1 10 # 2 1 1 # 3 1 24 # 4 1 36 # 5 1 45 # 6 1 300 # 7 2 8 # 8 2 11 
+6
source

Here is an alternative answer that I found after pushing in the right direction by Josh. This answer uses the igraph package. For those seeking and typing this answer, my test dataset is referred to as a “edge list” or “adjacency list” in graph theory ( http://en.wikipedia.org/wiki/Graph_theory )

 library(igraph) test <- data.frame(id1=c(10,10,1,1,24,8 ),id2=c(1,36,24,45,300,11)) gr.test <- graph.data.frame(test) links <- data.frame(id=unique(unlist(test)),group=clusters(gr.test)$membership) links[order(links$group),] # id group #1 10 1 #2 1 1 #3 24 1 #5 36 1 #6 45 1 #7 300 1 #4 8 2 #8 11 2 
+3
source

Without using packages:

 # 2 sets of test data mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49)) test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11)) grouppairs <- function(df){ # from wide to long format; assumes df is 2 columns of related id's test <- data.frame(group = 1:nrow(df),val = unlist(df)) # keep moving to next pair until all same values have same group i <- 0 while(any(duplicated(unique(test)$val))){ i <- i+1 # get group of matching values matches <- test[test$val == test$val[i],'group'] # change all groups with matching values to same group test[test$group %in% matches,'group'] <- test$group[i] } # renumber starting from 1 and show only unique values in group order test$group <- match(test$group, sort(unique(test$group))) unique(test)[order(unique(test)$group), ] } # test grouppairs(test) grouppairs(mytest) 
+1
source

All Articles