Regex matching inside dplyr

Answering this question , I wrote the following code:

df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))

require(stringr)

matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])

Now my question is: is there an easy way to combine the last two lines into one call dplyr, presumably using mutate()? As an alternative, I would also like to find a solution with do(). For the approach mutate(), since we are extracting 2 groups, I will take a solution that calls str_match()twice with different regular expressions, one for each desired group.

Edit: To clarify, the main problem I see here is that it str_matchreturns a matrix, and I am wondering how to handle it in mutate()or do(). I am not interested in solving the original problem using other methods of extracting information. There are many such solutions that are already given here.

+4
source share
2 answers

You can try with do

df %>% 
  do(data.frame(., str_match(.$Call_Num,  "([A-Z]+)(\\d+)\\s*\\.")[,-1],
                              stringsAsFactors=FALSE)) %>%
  rename_(.dots=setNames(names(.)[-1],c('letter', 'number')))
#                                             Call_Num letter number
#1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
#2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
#3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
#4          D753 .F4 Circulating Collection, 3rd Floor      D    753
#5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

Or, as @SamFirke commented, you can also rename columns using

  ---                                    %>%
 setNames(., c(names(.)[1], "letter", "number"))
+3
source

You can do this using the tidyrextract() package :

extract(df, Call_Num, into = c("letter", "number"), regex = "([A-Z]+)(\\d+)\\s*\\.", remove = FALSE)

                                             Call_Num letter number
1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
4          D753 .F4 Circulating Collection, 3rd Floor      D    753
5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

dplyr, , CRAN, , tidyr " ( ) dplyr- ".

+4

All Articles