Cyclization and Clustering

I must admit it is too hard for me to do it myself. I need to analyze some data, and this step is crucial for me.

The data I want to analyze is:

> dput(tbl_clustering) structure(list(P1 = structure(c(14L, 14L, 6L, 6L, 6L, 19L, 15L, 13L, 13L, 13L, 13L, 10L, 10L, 6L, 6L, 10L, 27L, 27L, 27L, 27L, 27L, 22L, 22L, 22L, 21L, 21L, 21L, 27L, 27L, 27L, 27L, 21L, 21L, 21L, 28L, 28L, 25L, 25L, 25L, 29L, 29L, 17L, 17L, 17L, 5L, 5L, 5L, 5L, 20L, 20L, 23L, 23L, 23L, 23L, 7L, 26L, 26L, 24L, 24L, 24L, 24L, 3L, 3L, 3L, 9L, 8L, 2L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 4L, 4L, 4L, 1L, 1L, 1L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("AT1G09130", "AT1G09620", "AT1G10760", "AT1G14610", "AT1G43170", "AT1G58080", "AT2G27680", "AT2G27710", "AT3G03710", "AT3G05590", "AT3G11510", "AT3G56130", "AT3G58730", "AT3G61540", "AT4G03520", "AT4G22930", "AT4G33030", "AT5G01600", "AT5G04710", "AT5G17990", "AT5G19220", "AT5G43940", "AT5G63310", "ATCG00020", "ATCG00380", "ATCG00720", "ATCG00770", "ATCG00810", "ATCG00900"), class = "factor"), P2 = structure(c(55L, 54L, 29L, 4L, 70L, 72L, 18L, 9L, 58L, 68L, 19L, 6L, 1L, 16L, 34L, 32L, 77L, 12L, 61L, 41L, 71L, 73L, 50L, 11L, 69L, 22L, 60L, 42L, 47L, 45L, 59L, 30L, 24L, 23L, 77L, 45L, 12L, 47L, 59L, 82L, 75L, 40L, 26L, 83L, 81L, 47L, 36L, 45L, 2L, 65L, 11L, 38L, 13L, 31L, 53L, 78L, 7L, 80L, 79L, 7L, 76L, 17L, 10L, 3L, 68L, 51L, 48L, 62L, 58L, 64L, 68L, 74L, 63L, 14L, 57L, 33L, 56L, 39L, 52L, 35L, 43L, 25L, 27L, 21L, 15L, 5L, 49L, 37L, 66L, 20L, 44L, 69L, 22L, 67L, 57L, 8L, 46L, 28L), .Label = c("AT1G01090", "AT1G02150", "AT1G03870", "AT1G09795", "AT1G13060", "AT1G14320", "AT1G15820", "AT1G17745", "AT1G20630", "AT1G29880", "AT1G29990", "AT1G43170", "AT1G52340", "AT1G52670", "AT1G56450", "AT1G59900", "AT1G69830", "AT1G75330", "AT1G78570", "AT2G05840", "AT2G28000", "AT2G34590", "AT2G35040", "AT2G37020", "AT2G40300", "AT2G42910", "AT2G44050", "AT2G44350", "AT2G45440", "AT3G01500", "AT3G03980", "AT3G04840", "AT3G07770", "AT3G13235", "AT3G14415", "AT3G18740", "AT3G22110", "AT3G22480", "AT3G22960", "AT3G51840", "AT3G54210", "AT3G54400", "AT3G56090", "AT3G60820", "AT4G00100", "AT4G00570", "AT4G02770", "AT4G11010", "AT4G14800", "AT4G18480", "AT4G20760", "AT4G26530", "AT4G28750", "AT4G30910", "AT4G30920", "AT4G33760", "AT4G34200", "AT5G02500", "AT5G02960", "AT5G10920", "AT5G12250", "AT5G13120", "AT5G16390", "AT5G18380", "AT5G35360", "AT5G35590", "AT5G35630", "AT5G35790", "AT5G48300", "AT5G52100", "AT5G56030", "AT5G60160", "AT5G64300", "AT5G67360", "ATCG00160", "ATCG00270", "ATCG00380", "ATCG00540", "ATCG00580", "ATCG00680", "ATCG00750", "ATCG00820", "ATCG01110"), class = "factor"), No_Interactions = c(8L, 5L, 5L, 9L, 7L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 5L, 8L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 10L, 6L, 6L, 5L, 5L, 5L, 5L, 8L, 5L, 5L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 6L, 5L, 5L, 6L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 8L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 5L, 5L, 6L)), .Names = c("P1", "P2", "No_Interactions"), class = "data.frame", row.names = c(NA, -98L)) 

To better explain what I want to achieve, I will put a few lines here:

  P1 P2 No_Interactions 1 AT3G61540 AT4G30920 8 2 AT3G61540 AT4G30910 5 3 AT1G58080 AT2G45440 5 4 AT1G58080 AT1G09795 9 5 AT1G58080 AT5G52100 7 6 AT5G04710 AT5G60160 6 7 AT4G03520 AT1G75330 5 8 AT3G58730 AT1G20630 5 9 AT3G58730 AT5G02500 5 10 AT3G58730 AT5G35790 5 

First of all, you need to create a new Cluster column. Then we focus only on the two columns P1 and P2 . As you can see in the first line, we have two names AT3G61540 and AT4G30920 and that is our starting point (a cycle, in my opinion, will be necessary). We put the number 1 in the Cluster column. Then we take the name AT3G61540 and look at both columns P1 and P2 , if we again find this name somewhere with a different name than in the first row, we put number 1 also in Cluster . Then we take the middle name from the first line of the AT4G30920 and do the same screening through the whole data.

The next step is to analyze the next line and do exactly the same thing. In this case, in the next line, we have exactly the same name for P1 , which means that we do not need to display it, but the second name AT4G30910 is different, so it would be great to display it. The problem that appears here is that this line should also be cluster 1 . cluster 2 starts on the third line, because we have a whole new pair of names.

I know that this is not an easy task and probably it needs to be done in a couple of steps.

EDIT: The result I would like to get:

  P1 P2 No_Interactions Cluster 1 AT3G61540 AT4G30920 8 1 2 AT3G61540 AT4G30910 5 1 3 AT1G58080 AT2G45440 5 2 4 AT1G58080 AT1G09795 9 2 5 AT1G58080 AT5G52100 7 2 6 AT5G04710 AT5G60160 6 3 7 AT5G52100 AT1G75330 5 2 ### Cluster 2 because AT5G52100 was found in the row number 5 as a partner of AT1G58080 8 AT3G58730 AT1G20630 5 5 9 AT3G58730 AT5G02500 5 5 10 AT3G58730 AT3G61540 5 1 ## Cluster 1 because AT3G61540 was found in first row. 
+5
source share
5 answers

I corrected my original answer and offered you a functional programming approach using Map and recursion to find your clusters:

 library(magrittr) similar = function(u,v) if(length(intersect(u,v))==0) FALSE else TRUE clusterify = function(df) { clusters = df$cluster if(!any(clusters==0)) return(df) idx = pmatch(0, clusters) lst = Map(c, as.character(df[,1]), as.character(df[,2])) el = c(as.character(df[idx, 1]), as.character(df[idx, 2])) K = lst %>% sapply(similar, v=el) %>% add(0) mask = if(any(clusters!=0 & K==1)) if(any(mask)) { cl = min(clusters[mask]) df[K==1,]$cluster = cl } else { df[K==1,]$cluster = max(clusters) + 1 } clusterify(df) } 

You can use it clusterify(transform(df, cluster=0))

For example, clustering works correctly in your example, taking cluster 9 (you can check other clusters):

 subset(clusterify(transform(df, cluster=0)), cluster==9) # P1 P2 No_Interactions cluster #25 AT5G19220 AT5G48300 10 9 #26 AT5G19220 AT2G34590 6 9 #27 AT5G19220 AT5G10920 6 9 #32 AT5G19220 AT3G01500 8 9 #33 AT5G19220 AT2G37020 5 9 #34 AT5G19220 AT2G35040 5 9 #92 AT4G22930 AT5G48300 5 9 #93 AT4G22930 AT2G34590 5 9 #94 AT4G22930 AT5G35630 5 9 #95 AT4G22930 AT4G34200 7 9 #96 AT4G22930 AT1G17745 5 9 #97 AT4G22930 AT4G00570 5 9 #98 AT4G22930 AT2G44350 6 9 
+5
source

You can use the igraph library to create an undirected graph in which you combine connected elements:

 library('igraph') # make graph and cluster g = graph.data.frame(tbl_clustering[,c('P1', 'P2')], directed=FALSE) c = clusters(g) # append cluster number to original data tbl_clustering$cluster = sapply(as.vector(tbl_clustering$P1), function(x)c$membership[x]) 

This assigns clusters to records (here are the first lines):

 > head(tbl_clustering, 8) P1 P2 No_Interactions cluster 1 AT3G61540 AT4G30920 8 1 2 AT3G61540 AT4G30910 5 1 3 AT1G58080 AT2G45440 5 2 4 AT1G58080 AT1G09795 9 2 5 AT1G58080 AT5G52100 7 2 6 AT5G04710 AT5G60160 6 3 7 AT4G03520 AT1G75330 5 4 8 AT3G58730 AT1G20630 5 5 
+3
source

I believe that you want to split your data set into equivalence classes. I have an implementation based on an algorithm in Numericical Recipes . I have included the code below. It can be used as follows:

 source("equivalence.R") ids <- unique(c(levels(data[[1]]), levels(data[[2]]))) classes <- equivalence(ids, data[1:2]) data$class <- classes[match(data$P1, ids)] 

equivalence.R

 library(Rcpp) Rcpp::sourceCpp('equivalence.cpp') equivalence <- function(x, rules) { tmp <- unique(x) tmp <- tmp[!is.na(tmp)] a <- match(rules[[1]], tmp) b <- match(rules[[2]], tmp) sel <- !is.na(a) & !is.na(b) if (any(!sel)) { warning("Not all values in rules are present in x.") a <- a[sel] b <- b[sel] } res <- c_equivalence(as.integer(a)-1L, as.integer(b)-1L, as.integer(length(tmp))) res[match(x, tmp)] + 1L } 

equivalence.cpp

 #include <Rh> #include <Rinternals.h> #include <string> extern "C" { // [[Rcpp::export]] SEXP c_equivalence(SEXP ra, SEXP rb, SEXP rn) { try { if (LENGTH(ra) != LENGTH(rb)) throw std::string("Lengths of a and be do not match."); int* a = INTEGER(ra); int* b = INTEGER(rb); int m = LENGTH(ra); int n = INTEGER(rn)[0]; SEXP classes = PROTECT(allocVector(INTSXP, n)); int* cls = INTEGER(classes); //Initialize each element its own class. for (int k = 0; k < n; k++) cls[k] = k; //For each piece of input information... for (int l = 0; l < m; l++) { //Track first element up to its ancestor. int j = a[l]; while (cls[j] != j) j = cls[j]; //Track second element up to its ancestor. int k = b[l]; while (cls[k] != k) k = cls[k]; //If they are not already related, make them so. if (j != k) { cls[j] = k; } } //Final sweep up to highest ancestors. for (int j = 0; j < n; j++) { while (cls[j] != cls[cls[j]]) cls[j] = cls[cls[j]]; } UNPROTECT(1); return classes; } catch(const std::string& e) { error(e.c_str()); return R_NilValue; } catch (...) { error("Uncaught exception."); return R_NilValue; } } } 
+1
source

Ok, here is a new answer that goes aside. Again, dat is a data frame.

 Cluster <- rep(NA, length(dat[, 1])) #initialise for(r in 1:length(Cluster)){ if(identical(as.numeric(r), 1)){Cmatches <- matrix(c(as.character(dat[1, 1]), as.character(dat[1, 2])), 2, 1)} matched <- F for(cl in 1:length(Cmatches[1,])){ if(sum(c(as.character(dat[r, 1]), as.character(dat[r, 2])) %in% Cmatches[, cl]) != 0){ #add P1 and P2 from this row to the cluster which it matches Cmatches <- rbind(Cmatches, matrix(c(if(cl != 1){rep("", (cl - 1)*2)}else{character(0)}, as.character(dat[r, 1]), as.character(dat[r, 2]), if(cl != length(Cmatches[1,])){rep("", (length(Cmatches[1, ]) - cl)*2)}else{character(0)}), 2, length(Cmatches[1,]), byrow = F)) matched <- T Cluster[r] <- cl } } if(!matched){ #add a new cluster, because doesn't match any existing Cmatches <- cbind(Cmatches, c(c(as.character(dat[r, 1]), as.character(dat[r, 2])), rep("", length(Cmatches[, 1]) - 2))) Cluster[r] <- length(Cmatches[1,]) } } 

After that, you should take the Cmatch matrix and then check the matches between the clusters using if(sum(match(Cmatch[, cl1], Cmatch[, cl2], incomparables = ""), na.rm = T) != 0) (where cl1 and cl2 are the cluster numbers for comparison). If this test was right, then these clusters should be grouped.

 Cmatched <- rep(NA, length(Cmatches[1,])) for(cl in 1:(length(Cmatches[1,]) - 1)){ for(cl2 in (cl + 1):length(Cmatches[1,])){ if(sum(match(Cmatches[, cl], Cmatches[, cl2], incomparables = ""), na.rm = T) != 0){ if(is.na(Cmatched[cl])){ Cluster[Cluster == cl2] <- cl Cmatched[cl2] <- cl }else{ Cluster[Cluster == cl2] <- Cmatched[cl] Cmatched[cl2] <- cl } } } } 

And I think there is your answer. Then just dat <- cbind(dat, Cluster) .

+1
source

It looks like you want to classify clustering. You should study the k-modes cluster, which is an extension of k-means . The k-mode algorithm reflects the steps of k-means. Here is the diagram presented in the document.

  • Randomly select k unique objects as the initial centers (modes) of the cluster.
  • Calculate the distance between each object and the cluster mode; Assign an object to a cluster whose center has the shortest distance to the object. repeat this step until all objects are assigned to clusters.
  • Select a new mode for each cluster and compare it with the previous mode. If different, return to step 2; otherwise stop

There are other problems discussed as k-prototypes (for mixing categorical with numerical data), fuzzy k-modes (for assigning cluster membership), and initialization of k-modes.

0
source

All Articles