Group capture in R with multiple capture groups

In R, is it possible to extract group capture from a regular expression? As far as I can tell, none of grep , grepl , regexpr , gregexpr , sub or gsub returns groups.

I need to extract key-value pairs from strings that are encoded this way:

 \((.*?) :: (0\.[0-9]+)\) 

I can always just do multiple grep with a complete match or do some external (non-R) processing, but I was hoping I could do it all inside R. Is there a function or package that provides such a function to do this?

+62
regex r capture
Jun 04 '09 at 18:25
source share
8 answers

str_match() , from the stringr package, will do this. It returns a character matrix with one column for each group in the match (and one for the entire match):

 > s = c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)") > str_match(s, "\\((.*?) :: (0\\.[0-9]+)\\)") [,1] [,2] [,3] [1,] "(sometext :: 0.1231313213)" "sometext" "0.1231313213" [2,] "(moretext :: 0.111222)" "moretext" "0.111222" 
+79
Apr 6 '12 at 3:13
source share

gsub does this from your example:

 gsub("\\((.*?) :: (0\\.[0-9]+)\\)","\\1 \\2", "(sometext :: 0.1231313213)") [1] "sometext 0.1231313213" 

you need to double escape \ s in quotation marks, then they will work for regex.

Hope this helps.

+33
Jun 04 '09 at 22:44
source share

Try regmatches() and regexec() :

 regmatches("(sometext :: 0.1231313213)",regexec("\\((.*?) :: (0\\.[0-9]+)\\)","(sometext :: 0.1231313213)")) [[1]] [1] "(sometext :: 0.1231313213)" "sometext" "0.1231313213" 
+15
May 15 '13 at 11:32
source share

gsub () can do this and only return a capture group:

However, for this to work, you must explicitly select elements outside your capture group, as indicated in the gsub () help.

(...) elements of the character vectors "x" that are not replaced will be returned unchanged.

So, if your text is selected in the middle of some line, add. * Before and after the capture group should only allow you to return it.

gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext :: 0.1231313213)") [1] "sometext 0.1231313213"

+13
Apr 26 2018-11-21T00:
source share

I like regular expressions compatible with perl. Maybe someone else too ...

Here is a function that executes perl-compatible regular expressions and matches the functionality of functions in other languages ​​that I'm used to:

 regexpr_perl <- function(expr, str) { match <- regexpr(expr, str, perl=T) matches <- character(0) if (attr(match, 'match.length') >= 0) { capture_start <- attr(match, 'capture.start') capture_length <- attr(match, 'capture.length') total_matches <- 1 + length(capture_start) matches <- character(total_matches) matches[1] <- substr(str, match, match + attr(match, 'match.length') - 1) if (length(capture_start) > 1) { for (i in 1:length(capture_start)) { matches[i + 1] <- substr(str, capture_start[[i]], capture_start[[i]] + capture_length[[i]] - 1) } } } matches } 
+3
Jan 29 '15 at 16:53
source share

This is how I ended up working on this issue. I used two separate regular expressions to match the first and second capture groups and ran two gregexpr , then pulled out the substrings:

 regex.string <- "(?<=\\().*?(?= :: )" regex.number <- "(?<= :: )\\d\\.\\d+" match.string <- gregexpr(regex.string, str, perl=T)[[1]] match.number <- gregexpr(regex.number, str, perl=T)[[1]] strings <- mapply(function (start, len) substr(str, start, start+len-1), match.string, attr(match.string, "match.length")) numbers <- mapply(function (start, len) as.numeric(substr(str, start, start+len-1)), match.number, attr(match.number, "match.length")) 
+2
Jun 05 '09 at 16:06
source share

As shown in the stringr package, this can be achieved using str_match() or str_extract() .

Adapted from the manual:

 library(stringr) strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "239 923 8115 and 842 566 4692", "Work: 579-499-7527", "$1000", "Home: 543.355.3679") phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})" 

Extracting and combining our groups:

 str_extract(strings, phone) # [1] "219 733 8965" "329-293-8753" NA "239 923 8115" "579-499-7527" NA # [7] "543.355.3679" 

Specifying groups with an output matrix (we are interested in columns 2 +):

 str_match(strings, phone) # [,1] [,2] [,3] [,4] # [1,] "219 733 8965" "219" "733" "8965" # [2,] "329-293-8753" "329" "293" "8753" # [3,] NA NA NA NA # [4,] "239 923 8115" "239" "923" "8115" # [5,] "579-499-7527" "579" "499" "7527" # [6,] NA NA NA NA # [7,] "543.355.3679" "543" "355" "3679" 
0
Dec 23 '15 at 15:37
source share

Solution with strcapture from utils :

 x <- c("key1 :: 0.01", "key2 :: 0.02") strcapture(pattern = "(.*) :: (0\\.[0-9]+)", x = x, proto = list(key = character(), value = double())) #> key value #> 1 key1 0.01 #> 2 key2 0.02 
0
Aug 24 '17 at 1:22 on
source share



All Articles