Extracting part of a line starts with a 4-digit number and ends with a period

I have a character vector, for example:

char <- c("cancer_6_53_7575_tumor.csv", "control_7_4_7363_healthy.csv") 

I want to extract the part of the line starting with "7" in the 4-digit patient identifier and ending with ".", But the following method does not work when there is 7 before this patient identifier.

 values <- unlist(qdapRegex::rm_between(char, "7", ".", extract = TRUE)) 

How to indicate that it should start with 7 in a 4-digit number?

+6
regex r
source share
2 answers

You can use this:

 char <- c("cancer_6_53_7575_tumor.csv", "control_7_4_7363_healthy.csv") gsub(".*(7\\d{3}.*)\\..*$", "\\1", char) [1] "7575_tumor" "7363_healthy" 
  • He searches for a three-digit string after 7 (makes it a four-digit string): 7\\d{3}
  • And begins to record the pattern until the first . : (7\\d{3}.*)\\.
  • Then it prints the recorded pattern: \\1
+6
source share

Another way is to use stringr .

 library(stringr) str_extract(char, '7\\d{3}[^\\.]*') ## [1] "7575_tumor" "7363_healthy" 

It will correspond to four numbers, starting from 7 and all to the point - . .

+3
source share

All Articles