Let me start with one example. You have the line "Cell Biology". The first substitution, A <-gsub("Biology", "Science", A) , turns it into "Cell Science". Which is then not replaced.
Since you are not using regular expressions, I would rather use some kind of hash for replacements:
myhash <- c( "Science", "Science", "Arts & Humanities", "Arts & Humanities", "Social Sciences", "Science", "Science", "Social Sciences", "Science", "Science", "Science", "Social Sciences", "Science", "Social Sciences", "Social Sciences", "Science", "Science", "Science", "Science", "Science" ) names( myhash ) <- c( "Biology", "Cell Biology", "Art", "Humanities, Multidisciplinary", "Psychology, Experimental", "Astronomy & Astrophysics", "Physics, Particles & Fields", "Economics", "Mathematics", "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", "Social Sciences, Mathematical Methods", "Geriatrics & Gerontology", "Gerontology", "Management", "Operations Research & Management Science", "Computer Science, Artificial Intelligence", "Computer Science, Information Systems", "Engineering, Electrical & Electronic", "Statistics & Probability" )
Now, given a line such as Biology, you can quickly find your category:
myhash[ "Biology" ]
I'm not sure why you want to use a list instead of a row vector, so I will simplify your case a bit:
A <- c("Biology","Cell Biology","Art", "Humanities, Multidisciplinary; Psychology, Experimental", "Astronomy & Astrophysics; Physics, Particles & Fields", "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods", "Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science", "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic", "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability")
A search in a search will not work for compound strings (containing ";"). You can separate them, however using strsplit . Then you can use unique to avoid repeating the term and put it back using the paste function.
stringedit <- function( x ) {
Here is the result if you want:
[1] "Science" "Science" "Arts & Humanities" "Arts & Humanities; Social Sciences" [5] "Science" "Social Sciences; Science" "Science" "Social Sciences" [9] "Social Sciences" "Science" "Science" "Social Sciences; Science"
Of course, you can call *apply several times like this:
a.spl <- sapply( A, strsplit, "; *" ) a.spl <- sapply( a.spl, function( x ) myhash[ x ] ) unlist( sapply( a.spl, collapse, "; " )
It is no more or less efficient than the previous code.
Yes, you can achieve the same as with regular expressions, but firstly, it will include line breaks anyway, and then use regular expressions like ^Biology$ to make sure they match "Biology" but not Cell Biology, etc. If you do not want to engage in constructions like ". * Biology". Finally, you still have to get rid of duplicates, and all of this would be, in my opinion, (i) less detailed (= more error prone) and (ii) not worth the effort.