Proper use of gsub / regular expressions in R?

I have long lists of strings, for example, this machine-readable example:

A <- list(c("Biology","Cell Biology","Art","Humanities, Multidisciplinary; Psychology, Experimental","Astronomy & Astrophysics; Physics, Particles & Fields","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods","Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science","Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability")) 

So it looks like this:

 > A [[1]] [1] "Biology" [2] "Cell Biology" [3] "Art" [4] "Humanities, Multidisciplinary; Psychology, Experimental" [5] "Astronomy & Astrophysics; Physics, Particles & Fields" [6] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods" [7] "Geriatrics & Gerontology" [8] "Gerontology" [9] "Management" [10] "Operations Research & Management Science" [11] "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic" [12] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability" 

I would like to edit these terms and eliminate duplicates in order to get this result:

  [1] "Science" [2] "Science" [3] "Arts & Humanities" [4] "Arts & Humanities; Social Sciences" [5] "Science" [6] "Social Sciences; Science" [7] "Science" [8] "Social Sciences" [9] "Social Sciences" [10] "Science" [11] "Science" [12] "Social Sciences; Science" 

So far, I only got this:

 stringedit <- function(A) { A <-gsub("Biology", "Science", A) A <-gsub("Cell Biology", "Science", A) A <-gsub("Art", "Arts & Humanities", A) A <-gsub("Humanities, Multidisciplinary", "Arts & Humanities", A) A <-gsub("Psychology, Experimental", "Social Sciences", A) A <-gsub("Astronomy & Astrophysics", "Science", A) A <-gsub("Physics, Particles & Fields", "Science", A) A <-gsub("Economics", "Social Sciences", A) A <-gsub("Mathematics", "Science", A) A <-gsub("Mathematics, Applied", "Science", A) A <-gsub("Mathematics, Interdisciplinary Applications", "Science", A) A <-gsub("Social Sciences, Mathematical Methods", "Social Sciences", A) A <-gsub("Geriatrics & Gerontology", "Science", A) A <-gsub("Gerontology", "Social Sciences", A) A <-gsub("Management", "Social Sciences", A) A <-gsub("Operations Research & Management Science", "Science", A) A <-gsub("Computer Science, Artificial Intelligence", "Science", A) A <-gsub("Computer Science, Information Systems", "Science", A) A <-gsub("Engineering, Electrical & Electronic", "Science", A) A <-gsub("Statistics & Probability", "Science", A) } B <- lapply(A, stringedit) 

But it does not work correctly:

 > B [[1]] [1] "Science" [2] "Cell Science" [3] "Arts & Humanities" [4] "Arts & Humanities; Social Sciences" [5] "Science; Science" [6] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences" [7] "Science" [8] "Social Sciences" [9] "Social Sciences" [10] "Operations Research & Social Sciences Science" [11] "Computer Science, Arts & Humanitiesificial Intelligence; Science; Science" [12] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences; Science" 

How can I achieve the correct conclusion mentioned above?
Thank you so much for your attention!

+6
source share
3 answers

Let me start with one example. You have the line "Cell Biology". The first substitution, A <-gsub("Biology", "Science", A) , turns it into "Cell Science". Which is then not replaced.

Since you are not using regular expressions, I would rather use some kind of hash for replacements:

 myhash <- c( "Science", "Science", "Arts & Humanities", "Arts & Humanities", "Social Sciences", "Science", "Science", "Social Sciences", "Science", "Science", "Science", "Social Sciences", "Science", "Social Sciences", "Social Sciences", "Science", "Science", "Science", "Science", "Science" ) names( myhash ) <- c( "Biology", "Cell Biology", "Art", "Humanities, Multidisciplinary", "Psychology, Experimental", "Astronomy & Astrophysics", "Physics, Particles & Fields", "Economics", "Mathematics", "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", "Social Sciences, Mathematical Methods", "Geriatrics & Gerontology", "Gerontology", "Management", "Operations Research & Management Science", "Computer Science, Artificial Intelligence", "Computer Science, Information Systems", "Engineering, Electrical & Electronic", "Statistics & Probability" ) 

Now, given a line such as Biology, you can quickly find your category:

 myhash[ "Biology" ] 

I'm not sure why you want to use a list instead of a row vector, so I will simplify your case a bit:

 A <- c("Biology","Cell Biology","Art", "Humanities, Multidisciplinary; Psychology, Experimental", "Astronomy & Astrophysics; Physics, Particles & Fields", "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods", "Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science", "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic", "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability") 

A search in a search will not work for compound strings (containing ";"). You can separate them, however using strsplit . Then you can use unique to avoid repeating the term and put it back using the paste function.

 stringedit <- function( x ) { # first, split into subterms a.all <- unlist( strsplit( x, "; *" ) ) ; paste( unique( myhash[ a.all ] ), collapse= "; " ) } unlist( lapply( A, stringedit ) ) 

Here is the result if you want:

 [1] "Science" "Science" "Arts & Humanities" "Arts & Humanities; Social Sciences" [5] "Science" "Social Sciences; Science" "Science" "Social Sciences" [9] "Social Sciences" "Science" "Science" "Social Sciences; Science" 

Of course, you can call *apply several times like this:

 a.spl <- sapply( A, strsplit, "; *" ) a.spl <- sapply( a.spl, function( x ) myhash[ x ] ) unlist( sapply( a.spl, collapse, "; " ) 

It is no more or less efficient than the previous code.

Yes, you can achieve the same as with regular expressions, but firstly, it will include line breaks anyway, and then use regular expressions like ^Biology$ to make sure they match "Biology" but not Cell Biology, etc. If you do not want to engage in constructions like ". * Biology". Finally, you still have to get rid of duplicates, and all of this would be, in my opinion, (i) less detailed (= more error prone) and (ii) not worth the effort.

+4
source

It was easier for me to have a two-column data.frame as a search, with one column for the course name and one column for the category. Here is an example:

 course.categories <- data.frame( Course = c("Art", "Humanities, Multidisciplinary", "Biology", "Cell Biology", "Astronomy & Astrophysics", "Physics, Particles & Fields", "Mathematics", "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", "Geriatrics & Gerontology", "Operations Research & Management Science", "Computer Science, Artificial Intelligence", "Computer Science, Information Systems", "Engineering, Electrical & Electronic", "Statistics & Probability", "Psychology, Experimental", "Economics", "Social Sciences, Mathematical Methods", "Gerontology", "Management"), Category = c("Arts & Humanities", "Arts & Humanities", "Science", "Science", "Science", "Science", "Science", "Science", "Science", "Science", "Science", "Science", "Science", "Science", "Science", "Social Sciences", "Social Sciences", "Social Sciences", "Social Sciences", "Social Sciences")) 

Then, if A is as a list, as in your question:

 sapply(strsplit(unlist(A), "; "), function(x) paste(unique(course.categories[match(x, course.categories[["Course"]]), "Category"]), collapse = "; ")) # [1] "Science" "Science" # [3] "Arts & Humanities" "Arts & Humanities; Social Sciences" # [5] "Science" "Social Sciences; Science" # [7] "Science" "Social Sciences" # [9] "Social Sciences" "Science" # [11] "Science" "Social Sciences; Science" 

match matches values ​​from A with the course names in the course.categories and indicates which rows match; this is used to retrieve the category to which the course belongs. Then unique ensures that we have only one of each category. paste puts things together.

+5
source

What about using switch ?

 science.category <- function(science){ switch(science, "Biology" =, "Cell Biology" =, "Astronomy & Astrophysics" =, "Physics, Particles & Fields" =, "Mathematics" =, "Mathematics, Applied" =, "Mathematics, Interdisciplinary Applications" =, "Geriatrics & Gerontology" =, "Operations Research & Management Science" =, "Computer Science, Artificial Intelligence" =, "Computer Science, Information Systems" =, "Engineering, Electrical & Electronic" =, "Statistics & Probability" = "Science", "Art" =, "Humanities, Multidisciplinary" = "Arts & Humanities", "Psychology, Experimental" =, "Economics" =, "Social Sciences, Mathematical Methods" =, "Gerontology" =, "Management" = "Social Sciences", NA ) } a <- unlist(lapply(A, strsplit, split = " *; *"), recursive = FALSE) a1 <- lapply(a, function(x) unique(sapply(x, science.category))) sapply(a1, paste, collapse = "; ") 

Of course, this will work as long as you have the correct lines that you can use as switch arguments. One mismatch, and you're done with NA . For some extended use, you must write your own wrapper to use the grep family of functions or even agrep (handle with care).

+2
source

All Articles