Counting unique elements when some of them are synonymous with each other

I am trying to count the number of unique drugs on this list.

my_drugs=c('a', 'b', 'd', 'h', 'q') 

I have the following dictionary that gives me synonyms for drugs, but it is not configured so that the definitions are only for unique drugs:

 dictionary <- read.table(header=TRUE, text=" drug names ab;c;d;x xb;c;q rh;g;f lm;n ") 

Thus, in this case there are 2 unique drugs on the list (because a, directly or indirectly, has synonyms b, d, q). Synonyms of synonyms are considered synonyms.

My attempt is to first make a dictionary that had only unique drugs on the left side. To do this, I would cycle through the $ drug dictionary, grep in the $ drug dictionary and dictionary $ synonyms, take the union of these and replace the $$ synonyms, and then delete other lines from the dictionary.

 bigdf=dictionary small_df=data.frame("drug"=NA,"names"=NA) for(i in 1:nrow(bigdf)){ search_term=sprintf("*%s*",bigdf$drug[i]) index=grep(search_term,bigdf$names) list=bigdf$names[index] list=Reduce(union,list) list=paste(list, collapse=";") if(!list==""){ new_row=data.frame("drug"=bigdf$drug[index][1],"names"=list) small_df=rbind(small_df,new_row) #small_df bigdf=bigdf[-index,] #dim(bigdf) } else{ new_row=data.frame("drug"=bigdf$drug[index][1],"names"="alreadycounted") small_df=rbind(small_df,new_row) } } 

It didn’t work (some drugs were missing from small_df), and even if I hadn’t been sure how I would use my new dictionary to count the number of unique drugs on my list.

How can I count the number of unique drugs in my_drugs?

Thanks for the help, and let me know if this requires further clarification.

Dataset size: 200 items in my_drugs, 2000 lines in the dictionary, each drug has 10-12 synonyms.

+7
r unique overlap synonym
source share
1 answer
 library(igraph) df1 = unique(data.frame(do.call( rbind, apply(X = dictionary, MARGIN = 1, FUN = function(x) t(combn(unlist(strsplit(x, ";")), 2, sort)))))) g = graph.data.frame(df1) g2 = delete.vertices(g, unique(V(g)$name)[!unique(V(g)$name) %in% my_drugs]) clusters(g2)$no #[1] 2 
+1
source share

All Articles