Original answer: See below for an update.
First, I made your example data more complex by adding the first line to the bottom.
dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003" ), label = c("excision of CNS", "ventricular shunt", "excision of CNS" )), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame") dff # code label # 1 61000-61003 excision of CNS # 2 0169T-0169T ventricular shunt # 3 61000-61003 excision of CNS
We can use the sequence operator : to get the sequences for the code column by wrapping with tryCatch() so that we can avoid the error and save values ββthat cannot be ordered. First, divide the values ββinto the label - , then pass it through lapply() .
xx <- lapply( strsplit(dff$code, "-", fixed = TRUE), function(x) tryCatch(x[1]:x[2], warning = function(w) x) ) data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx))) # code label # 1 61000 excision of CNS # 2 61001 excision of CNS # 3 61002 excision of CNS # 4 61003 excision of CNS # 5 0169T ventricular shunt # 6 0169T ventricular shunt # 7 61000 excision of CNS # 8 61001 excision of CNS # 9 61002 excision of CNS # 10 61003 excision of CNS
We are trying to apply the sequence operator : to each element from strsplit() , and if the adoption of x[1]:x[2] impossible, then it returns only the values ββfor these elements and continues with the sequence x[1]:x[2] otherwise . Then we simply replicate the values ββof the label column based on the resulting lengths in xx to get a new label column.
Update: This is what I got in response to your editing. Replace xx above
xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) { s <- stringi::stri_locate_first_regex(x, "[AZ]") nc <- nchar(x)[1L] fmt <- function(n) paste0("%0", n, "d") if(!all(is.na(s))) { ss <- s[1,1] fmt <- fmt(nc-1) if(ss == 1L) { xx <- substr(x, 2, nc) paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2])) } else { xx <- substr(x, 1, ss-1) paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc)) } } else { sprintf(fmt(nc), x[1]:x[2]) } })
Yes. This is hard. Now, if we take the next df2 data df2 as a test case
df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003", "T0169-T0174"), label = c("excision of CNS", "ventricular shunt", "excision of CNS", "ventricular shunt")), .Names = c("code", "label"), row.names = c(NA, 4L), class = "data.frame")
and run the xx code on top of it, we get the following result.
data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx)))