I must admit it is too hard for me to do it myself. I need to analyze some data, and this step is crucial for me.
The data I want to analyze is:
> dput(tbl_clustering) structure(list(P1 = structure(c(14L, 14L, 6L, 6L, 6L, 19L, 15L, 13L, 13L, 13L, 13L, 10L, 10L, 6L, 6L, 10L, 27L, 27L, 27L, 27L, 27L, 22L, 22L, 22L, 21L, 21L, 21L, 27L, 27L, 27L, 27L, 21L, 21L, 21L, 28L, 28L, 25L, 25L, 25L, 29L, 29L, 17L, 17L, 17L, 5L, 5L, 5L, 5L, 20L, 20L, 23L, 23L, 23L, 23L, 7L, 26L, 26L, 24L, 24L, 24L, 24L, 3L, 3L, 3L, 9L, 8L, 2L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 4L, 4L, 4L, 1L, 1L, 1L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("AT1G09130", "AT1G09620", "AT1G10760", "AT1G14610", "AT1G43170", "AT1G58080", "AT2G27680", "AT2G27710", "AT3G03710", "AT3G05590", "AT3G11510", "AT3G56130", "AT3G58730", "AT3G61540", "AT4G03520", "AT4G22930", "AT4G33030", "AT5G01600", "AT5G04710", "AT5G17990", "AT5G19220", "AT5G43940", "AT5G63310", "ATCG00020", "ATCG00380", "ATCG00720", "ATCG00770", "ATCG00810", "ATCG00900"), class = "factor"), P2 = structure(c(55L, 54L, 29L, 4L, 70L, 72L, 18L, 9L, 58L, 68L, 19L, 6L, 1L, 16L, 34L, 32L, 77L, 12L, 61L, 41L, 71L, 73L, 50L, 11L, 69L, 22L, 60L, 42L, 47L, 45L, 59L, 30L, 24L, 23L, 77L, 45L, 12L, 47L, 59L, 82L, 75L, 40L, 26L, 83L, 81L, 47L, 36L, 45L, 2L, 65L, 11L, 38L, 13L, 31L, 53L, 78L, 7L, 80L, 79L, 7L, 76L, 17L, 10L, 3L, 68L, 51L, 48L, 62L, 58L, 64L, 68L, 74L, 63L, 14L, 57L, 33L, 56L, 39L, 52L, 35L, 43L, 25L, 27L, 21L, 15L, 5L, 49L, 37L, 66L, 20L, 44L, 69L, 22L, 67L, 57L, 8L, 46L, 28L), .Label = c("AT1G01090", "AT1G02150", "AT1G03870", "AT1G09795", "AT1G13060", "AT1G14320", "AT1G15820", "AT1G17745", "AT1G20630", "AT1G29880", "AT1G29990", "AT1G43170", "AT1G52340", "AT1G52670", "AT1G56450", "AT1G59900", "AT1G69830", "AT1G75330", "AT1G78570", "AT2G05840", "AT2G28000", "AT2G34590", "AT2G35040", "AT2G37020", "AT2G40300", "AT2G42910", "AT2G44050", "AT2G44350", "AT2G45440", "AT3G01500", "AT3G03980", "AT3G04840", "AT3G07770", "AT3G13235", "AT3G14415", "AT3G18740", "AT3G22110", "AT3G22480", "AT3G22960", "AT3G51840", "AT3G54210", "AT3G54400", "AT3G56090", "AT3G60820", "AT4G00100", "AT4G00570", "AT4G02770", "AT4G11010", "AT4G14800", "AT4G18480", "AT4G20760", "AT4G26530", "AT4G28750", "AT4G30910", "AT4G30920", "AT4G33760", "AT4G34200", "AT5G02500", "AT5G02960", "AT5G10920", "AT5G12250", "AT5G13120", "AT5G16390", "AT5G18380", "AT5G35360", "AT5G35590", "AT5G35630", "AT5G35790", "AT5G48300", "AT5G52100", "AT5G56030", "AT5G60160", "AT5G64300", "AT5G67360", "ATCG00160", "ATCG00270", "ATCG00380", "ATCG00540", "ATCG00580", "ATCG00680", "ATCG00750", "ATCG00820", "ATCG01110"), class = "factor"), No_Interactions = c(8L, 5L, 5L, 9L, 7L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 5L, 8L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 10L, 6L, 6L, 5L, 5L, 5L, 5L, 8L, 5L, 5L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 6L, 5L, 5L, 6L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 8L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 5L, 5L, 6L)), .Names = c("P1", "P2", "No_Interactions"), class = "data.frame", row.names = c(NA, -98L))
To better explain what I want to achieve, I will put a few lines here:
P1 P2 No_Interactions 1 AT3G61540 AT4G30920 8 2 AT3G61540 AT4G30910 5 3 AT1G58080 AT2G45440 5 4 AT1G58080 AT1G09795 9 5 AT1G58080 AT5G52100 7 6 AT5G04710 AT5G60160 6 7 AT4G03520 AT1G75330 5 8 AT3G58730 AT1G20630 5 9 AT3G58730 AT5G02500 5 10 AT3G58730 AT5G35790 5
First of all, you need to create a new Cluster column. Then we focus only on the two columns P1 and P2 . As you can see in the first line, we have two names AT3G61540 and AT4G30920 and that is our starting point (a cycle, in my opinion, will be necessary). We put the number 1 in the Cluster column. Then we take the name AT3G61540 and look at both columns P1 and P2 , if we again find this name somewhere with a different name than in the first row, we put number 1 also in Cluster . Then we take the middle name from the first line of the AT4G30920 and do the same screening through the whole data.
The next step is to analyze the next line and do exactly the same thing. In this case, in the next line, we have exactly the same name for P1 , which means that we do not need to display it, but the second name AT4G30910 is different, so it would be great to display it. The problem that appears here is that this line should also be cluster 1 . cluster 2 starts on the third line, because we have a whole new pair of names.
I know that this is not an easy task and probably it needs to be done in a couple of steps.
EDIT: The result I would like to get:
P1 P2 No_Interactions Cluster 1 AT3G61540 AT4G30920 8 1 2 AT3G61540 AT4G30910 5 1 3 AT1G58080 AT2G45440 5 2 4 AT1G58080 AT1G09795 9 2 5 AT1G58080 AT5G52100 7 2 6 AT5G04710 AT5G60160 6 3 7 AT5G52100 AT1G75330 5 2