Split a line when an uppercase letter follows a lowercase letter in the middle of a word in R

I have some problems concatenating different strings and that I would like to split again. I deal with things like

name="on-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol" 

which in this case should be divided into "on-Butylhydroxylamine", "1-Methylpropylhydroxylamine" and "Amino-2-butanol"

Any thoughts on how I could use the regular expression strsplit and / or gsub to achieve this? The rule that I would like to use is that I would like to split the word when either a number, or a bracket ("("), or an uppercase letter follows the lowercase letter. Any thoughts how to do this?

+6
source share
3 answers

You could use positive traversal statements to search for (and then divide by) intersymbol positions preceded by a lowercase letter and using an uppercase letter, digit, or ( .

 name <- "on-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol" pat <- "(?<=[[:lower:]])(?=[[:upper:][:digit:](])" strsplit(name, pat, perl=TRUE) # [[1]] # [1] "on-Butylhydroxylamine" "1-Methylpropylhydroxylamine" # [3] "Amino-2-butanol" 
+9
source
 strsplit(name, "(?<=([az]))(?=[AZ]|[0-9]|\\()", perl=TRUE) # [[1]] # [1] "on-Butylhydroxylamine" "1-Methylpropylhydroxylamine" "Amino-2-butanol" 

Remember that the return value is a list, so use [[1]] if necessary.

+3
source

Try the following:

 name="on-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol" print(strsplit(gsub("([az])(\\d)","\\1#\\2", gsub("([az])([AZ])","\\1#\\2",name)),"#")[[1]]) 

It is assumed that a letter without a cap followed by a number is a separation, as well as a non-cap followed by a cap.

+2
source

All Articles