Creating new variables based on specific values

Question

Creating new variables based on specific values

I read about regular expressions and Hadley Wickham stringr and dplyr , but can't figure out how to make this work.

I have library access data in a data frame whose call number is a character variable. I would like to take the initial capital letters and make a new variable, and the numbers between the letters and the period - the second new variable.

 Call_Num HV5822.H4 C47 Circulating Collection, 3rd Floor QE511.4 .G53 1982 Circulating Collection, 3rd Floor TL515 .M63 Circulating Collection, 3rd Floor D753 .F4 Circulating Collection, 3rd Floor DB89.F7 D4 Circulating Collection, 3rd Floor

+5

regex r dplyr stringr

Concept delta Jul 07 '15 at 4:13

source share

4 answers

What about

 rl <- read.table(header = TRUE, text = "Call_Num 'HV5822.H4 C47 Circulating Collection, 3rd Floor' 'QE511.4 .G53 1982 Circulating Collection, 3rd Floor' 'TL515 .M63 Circulating Collection, 3rd Floor' 'D753 .F4 Circulating Collection, 3rd Floor' 'DB89.F7 D4 Circulating Collection, 3rd Floor'", stringsAsFactors = FALSE) cbind(rl, read.table(text = gsub('([AZ]+)([0-9]+).*', '\\1 \\2', rl$Call_Num))) # Call_Num V1 V2 # 1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822 # 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511 # 3 TL515 .M63 Circulating Collection, 3rd Floor TL 515 # 4 D753 .F4 Circulating Collection, 3rd Floor D 753 # 5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89

+2

rawr Jul 07 '15 at 4:41

source share

If you want to use stringr , the solution will probably look something like this:

 df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor")) require(stringr) matches = str_match(df$Call_Num, "([AZ]+)(\\d+)\\s*\\.") df2 <- data.frame(df, letter=matches[,2], number=matches[,3]) df2 ## Call_Num letter number ## 1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822 ## 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511 ## 3 TL515 .M63 Circulating Collection, 3rd Floor TL 515 ## 4 D753 .F4 Circulating Collection, 3rd Floor D 753 ## 5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89

I don’t think sticking the str_match() call to mutate() of dplyr worth it just to leave it to that. Or use a rawr solution.

+2

Claus wilke Jul 07 '15 at 4:53

source share

You can use bind to gsubfn package:

 library(gsubfn) m <- strapply(as.character(df$Call_Num), '^([AZ]+)(\\d+)', ~ c(id = x, num = y), simplify = rbind) X <- as.data.frame(m, stringsAsFactors = FALSE) # id num # 1 HV 5822 # 2 QE 511 # 3 TL 515 # 4 D 753 # 5 DB 89

+2

hwnd Jul 07 '15 at 5:46

source share

jazzurro · Accepted Answer · 2015-07-07T04:57:57+0000

Using the stringi package, this will be one of the options. Since your goal remains at the beginning of the lines, stri_extract_first() will work just fine. [:alpha:]{1,} indicates sequences of alphabets that contain more than one alphabet. With stri_extract_first() you can identify the first sequence of the alphabet. Similarly, you can find the first sequence of numbers with stri_extract_first(x, regex = "\\d{1,}") .

 x <- c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor") library(stringi) data.frame(alpha = stri_extract_first(x, regex = "[:alpha:]{1,}"), number = stri_extract_first(x, regex = "\\d{1,}")) # alpha number #1 HV 5822 #2 QE 511 #3 TL 515 #4 D 753 #5 DB 89

Creating new variables based on specific values

More articles: