Extract substring in R by pattern

Suppose I have a list of strings: string = c("G1:E001", "G2:E002", "G3:E003") . Now I hope to get a line vector that contains only the parts after the colon ":", that is, substring = c(E001,E002,E003) . Is there any convenient way in R to do this? Using substr ? Thank you

+107
regex r substr
Jun 20 '13 at 14:06 on
source share
7 answers

Here are some ways:

1) sub

 sub(".*:", "", string) ## [1] "E001" "E002" "E003" 

2) St split

 sapply(strsplit(string, ":"), "[", 2) ## [1] "E001" "E002" "E003" 

3) read.table

 read.table(text = string, sep = ":", as.is = TRUE)$V2 ## [1] "E001" "E002" "E003" 

4) substring

This assumes that the second part always starts with the 4th character (which is the case in the example in the question):

 substring(string, 4) ## [1] "E001" "E002" "E003" 

4a) substring / regular expression

If the colon were not always in a known position, we could change (4) by searching for it:

 substring(string, regexpr(":", string) + 1) 

5) strap

strapplyc returns the part in parentheses:

 library(gsubfn) strapplyc(string, ":(.*)", simplify = TRUE) ## [1] "E001" "E002" "E003" 

6) read.dcf

This one works only if the substrings before the colon are unique (what are they in the example in the question). It also requires the delimiter to be a colon (as discussed). If another separator was used, then we could use sub to replace it with a colon first. For example, if the delimiter was _ then string <- sub("_", ":", string)

 c(read.dcf(textConnection(string))) ## [1] "E001" "E002" "E003" 

7) tidyr::separate Using tidyr::separate we create a data frame with two columns, one for the part before the colon and one for the after, and then we extract the last.

 library(dplyr) library(tidyr) library(purrr) DF <- data.frame(string) DF %>% separate(string, into = c("pre", "post")) %>% pull("post") ## [1] "E001" "E002" "E003" 

7a) Alternatively, separate can be used to simply create post columns, and then unlist and unname resulting data frame:

 library(dplyr) library(tidyr) DF %>% separate(string, into = c(NA, "post")) %>% unlist %>% unname ## [1] "E001" "E002" "E003" 

ADDED. strapplyc , read.dcf and separate solutions.

NOTE.

The input string is assumed to be:

 string <- c("G1:E001", "G2:E002", "G3:E003") 
+177
Jun 20 '13 at 14:10
source share

For example using gsub or sub

  gsub('.*:(.*)','\\1',string) 1] "E001" "E002" "E003" 
+22
Jun 20 '13 at 14:10
source share

Here is another simple answer

 gsub("^.*:","", string) 
+9
Apr 21 '14 at 19:49
source share

A late party, but for posterity, the stringr package (part of the popular tidyverse package) now provides functions with consistent signatures for processing strings:

 string <- c("G1:E001", "G2:E002", "G3:E003") stringr::str_extract(string = string, pattern = "E[0-9]+") # [1] "E001" "E002" "E003" 
+6
Oct 02 '18 at 12:47
source share

This should do:

 gsub("[AZ][1-9]:", "", string) 

gives

 [1] "E001" "E002" "E003" 
+4
Jun 20 '13 at 14:10
source share

If you use data.table then tstrsplit() is the natural choice:

 tstrsplit(string, ":")[[2]] [1] "E001" "E002" "E003" 
+1
Oct 02 '18 at 12:51
source share

I have a related question. How do you extract a line from the beginning of a line to the second occurrence of a comma?

0
Jul 03 '19 at 16:54
source share



All Articles