Extract substring in R by pattern

Question

Extract substring in R by pattern

Suppose I have a list of strings: string = c("G1:E001", "G2:E002", "G3:E003") . Now I hope to get a line vector that contains only the parts after the colon ":", that is, substring = c(E001,E002,E003) . Is there any convenient way in R to do this? Using substr ? Thank you

+107

regex r substr

alittleboy Jun 20 '13 at 14:06 on

source share

7 answers

For example using gsub or sub

  gsub('.*:(.*)','\\1',string) 1] "E001" "E002" "E003"

+22

agstudy Jun 20 '13 at 14:10

source share

Here is another simple answer

 gsub("^.*:","", string)

+9

Ragy Isaac Apr 21 '14 at 19:49

source share

A late party, but for posterity, the stringr package (part of the popular tidyverse package) now provides functions with consistent signatures for processing strings:

 string <- c("G1:E001", "G2:E002", "G3:E003") stringr::str_extract(string = string, pattern = "E[0-9]+") # [1] "E001" "E002" "E003"

+6

CSJCampbell Oct 02 '18 at 12:47

source share

This should do:

 gsub("[AZ][1-9]:", "", string)

gives

 [1] "E001" "E002" "E003"

+4

user1981275 Jun 20 '13 at 14:10

source share

If you use data.table then tstrsplit() is the natural choice:

 tstrsplit(string, ":")[[2]] [1] "E001" "E002" "E003"

+1

sindri_baldur Oct 02 '18 at 12:51

source share

I have a related question. How do you extract a line from the beginning of a line to the second occurrence of a comma?

0

Susan Gottfried Jul 03 '19 at 16:54

source share

G. Grothendieck · Accepted Answer · 2013-06-20 14:10

Here are some ways:

1) sub

 sub(".*:", "", string) ## [1] "E001" "E002" "E003"

2) St split

 sapply(strsplit(string, ":"), "[", 2) ## [1] "E001" "E002" "E003"

3) read.table

 read.table(text = string, sep = ":", as.is = TRUE)$V2 ## [1] "E001" "E002" "E003"

4) substring

This assumes that the second part always starts with the 4th character (which is the case in the example in the question):

 substring(string, 4) ## [1] "E001" "E002" "E003"

4a) substring / regular expression

If the colon were not always in a known position, we could change (4) by searching for it:

 substring(string, regexpr(":", string) + 1)

5) strap

strapplyc returns the part in parentheses:

 library(gsubfn) strapplyc(string, ":(.*)", simplify = TRUE) ## [1] "E001" "E002" "E003"

6) read.dcf

This one works only if the substrings before the colon are unique (what are they in the example in the question). It also requires the delimiter to be a colon (as discussed). If another separator was used, then we could use sub to replace it with a colon first. For example, if the delimiter was _ then string <- sub("_", ":", string)

 c(read.dcf(textConnection(string))) ## [1] "E001" "E002" "E003"

7) tidyr::separate Using tidyr::separate we create a data frame with two columns, one for the part before the colon and one for the after, and then we extract the last.

 library(dplyr) library(tidyr) library(purrr) DF <- data.frame(string) DF %>% separate(string, into = c("pre", "post")) %>% pull("post") ## [1] "E001" "E002" "E003"

7a) Alternatively, separate can be used to simply create post columns, and then unlist and unname resulting data frame:

 library(dplyr) library(tidyr) DF %>% separate(string, into = c(NA, "post")) %>% unlist %>% unname ## [1] "E001" "E002" "E003"

ADDED. strapplyc , read.dcf and separate solutions.

NOTE.

The input string is assumed to be:

 string <- c("G1:E001", "G2:E002", "G3:E003")

Extract substring in R by pattern

More articles: