Lookaround lookrefore for regex for R

Question

Lookaround lookrefore for regex for R

I am trying to use regular expressions using the stringr package to extract some text. For some reason, I get the error "Invalid regexp". I tried regex expression in some website testing tools and it seems to work there. I was wondering if there is something unique in how the regular expression works in R and especially in the stringr package.

Here is an example:

string <- c("MARKETING: Vice President", "FINANCE: Accountant I", "OPERATIONS: Plant Manager") pattern <- "[AZ]+(?=:)" test <- gsub(" ","",string) results <- str_extract(test, pattern)

This does not work. I would like to get "MARKETING", "FINANCE" and "OPERATIONS" without ":" in them. This is why I use lookahead syntax. I understand that I can just get around this using:

 pattern <- "[AZ]+(:)" test <- gsub(" ","",string) results <- gsub(":","",str_extract(test, pattern))

But I expect that I may need to use images for more complex situations than this in the near future.

Do I need to adjust regex with some screens or something to make this work?

+4

regex r

exl Jan 03 '13 at 15:34

source share

2 answers

You can do this directly with sub and grouping.

 sub('^([AZ]+):.*$', '\\1', string) # [1] "MARKETING" "FINANCE" "OPERATIONS"

Where I commit the group to the beginning of the line, looking for one or more capital letters and saving them. They should be followed by a colon, : and then zero or more additional characters.

+2

Justin Jan 03 '13 at 15:38

source share

Matthew plourde · Accepted Answer · 2013-01-03T15:52:22+0000

Lookahead statements require that you define a regular expression as a perl regular expression in R.

 str_extract(string, perl(pattern)) # [1] "MARKETING" "FINANCE" "OPERATIONS"

You can also do this easily in the R database:

 regmatches(string, regexpr(pattern, string, perl=TRUE)) # [1] "MARKETING" "FINANCE" "OPERATIONS"

regexpr finds matches and regmatches uses matching data to extract substrings.

Lookaround lookrefore for regex for R

More articles: