Dividing a single column into multiple cases using R

I am working on HCUP data and this has a range of values ​​in one column that needs to be split into multiple columns. Below is the HCUP data frame for reference:

code label 61000-61003 excision of CNS 0169T-0169T ventricular shunt 

The required conclusion should be:

 code label 61000 excision of CNS 61001 excision of CNS 61002 excision of CNS 61003 excision of CNS 0169T ventricular shunt 

My approach to this problem is to use splitstackshape and use this code

 library(data.table) library(splitstackshape) cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)] 

This approach leads to memory problems. Is there a better approach to this problem?

Some comments:

  • Data has many letters besides "T".
  • A letter can be either in front or at the very end, but not between two numbers.
  • From the letter "T" to "U" there are no changes in the same range
+7
r medical data.table data-cleaning splitstackshape
source share
5 answers

Here's a solution using dplyr and all.is.numeric from Hmisc :

 library(dplyr) library(Hmisc) library(tidyr) dat %>% separate(code, into=c("code1", "code2")) %>% rowwise %>% mutate(lists = ifelse(all.is.numeric(c(code1, code2)), list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))), list(code1))) %>% unnest(lists) %>% select(code = lists, label) Source: local data frame [5 x 2] code label (chr) (fctr) 1 61000 excision of CNS 2 61001 excision of CNS 3 61002 excision of CNS 4 61003 excision of CNS 5 0169T ventricular shunt 

Editing to change ranges with character values. Simplifies slightly:

 dff %>% mutate(row = row_number()) %>% separate(code, into=c("code1", "code2")) %>% group_by(row) %>% summarise(lists = if(all.is.numeric(c(code1, code2))) {list(str_pad(as.character( seq(from = as.numeric(code1), to = as.numeric(code2))), nchar(code1), pad="0"))} else if(grepl("^[0-9]", code1)) {list(str_pad(paste0(as.character( seq(from = extract_numeric(code1), to = extract_numeric(code2))), strsplit(code1, "[0-9]+")[[1]][2]), nchar(code1), pad = "0"))} else {list(paste0( strsplit(code1, "[0-9]+")[[1]], str_pad(as.character( seq(from = extract_numeric(code1), to = extract_numeric(code2))), nchar(gsub("[^0-9]", "", code1)), pad="0")))}, label = first(label)) %>% unnest(lists) %>% select(-row) Source: local data frame [15 x 2] label lists (chr) (chr) 1 excision of CNS 61000 2 excision of CNS 61001 3 excision of CNS 61002 4 ventricular shunt 0169T 5 ventricular shunt 0170T 6 ventricular shunt 0171T 7 excision of CNS 01000 8 excision of CNS 01001 9 excision of CNS 01002 10 some procedure A2543 11 some procedure A2544 12 some procedure A2545 13 some procedure A0543 14 some procedure A0544 15 some procedure A0545 

Data:

 dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", "A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", "excision of CNS", "some procedure", "some procedure")), .Names = c("code", "label"), row.names = c(NA, 5L), class = "data.frame") 
+7
source share

Original answer: See below for an update.

First, I made your example data more complex by adding the first line to the bottom.

 dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003" ), label = c("excision of CNS", "ventricular shunt", "excision of CNS" )), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame") dff # code label # 1 61000-61003 excision of CNS # 2 0169T-0169T ventricular shunt # 3 61000-61003 excision of CNS 

We can use the sequence operator : to get the sequences for the code column by wrapping with tryCatch() so that we can avoid the error and save values ​​that cannot be ordered. First, divide the values ​​into the label - , then pass it through lapply() .

 xx <- lapply( strsplit(dff$code, "-", fixed = TRUE), function(x) tryCatch(x[1]:x[2], warning = function(w) x) ) data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx))) # code label # 1 61000 excision of CNS # 2 61001 excision of CNS # 3 61002 excision of CNS # 4 61003 excision of CNS # 5 0169T ventricular shunt # 6 0169T ventricular shunt # 7 61000 excision of CNS # 8 61001 excision of CNS # 9 61002 excision of CNS # 10 61003 excision of CNS 

We are trying to apply the sequence operator : to each element from strsplit() , and if the adoption of x[1]:x[2] impossible, then it returns only the values ​​for these elements and continues with the sequence x[1]:x[2] otherwise . Then we simply replicate the values ​​of the label column based on the resulting lengths in xx to get a new label column.


Update: This is what I got in response to your editing. Replace xx above

 xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) { s <- stringi::stri_locate_first_regex(x, "[AZ]") nc <- nchar(x)[1L] fmt <- function(n) paste0("%0", n, "d") if(!all(is.na(s))) { ss <- s[1,1] fmt <- fmt(nc-1) if(ss == 1L) { xx <- substr(x, 2, nc) paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2])) } else { xx <- substr(x, 1, ss-1) paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc)) } } else { sprintf(fmt(nc), x[1]:x[2]) } }) 

Yes. This is hard. Now, if we take the next df2 data df2 as a test case

 df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003", "T0169-T0174"), label = c("excision of CNS", "ventricular shunt", "excision of CNS", "ventricular shunt")), .Names = c("code", "label"), row.names = c(NA, 4L), class = "data.frame") 

and run the xx code on top of it, we get the following result.

 data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx))) # code label # 1 61000 excision of CNS # 2 61001 excision of CNS # 3 61002 excision of CNS # 4 61003 excision of CNS # 5 0169T ventricular shunt # 6 0170T ventricular shunt # 7 0171T ventricular shunt # 8 0172T ventricular shunt # 9 0173T ventricular shunt # 10 0174T ventricular shunt # 11 61000 excision of CNS # 12 61001 excision of CNS # 13 61002 excision of CNS # 14 61003 excision of CNS # 15 T0169 ventricular shunt # 16 T0170 ventricular shunt # 17 T0171 ventricular shunt # 18 T0172 ventricular shunt # 19 T0173 ventricular shunt # 20 T0174 ventricular shunt 
+6
source share

Create a sequence rule for these codes:

 seq_code <- function(from,to){ ext = function(x, part) gsub("([^0-9]?)([0-9]*)([^0-9]?)", paste0("\\",part), x) pre = unique(sapply(list(from,to), ext, part = 1 )) suf = unique(sapply(list(from,to), ext, part = 3 )) if (length(pre) > 1 | length(suf) > 1){ return("NO!") } num = do.call(seq, lapply(list(from,to), function(x) as.integer(ext(x, part = 2)))) len = nchar(from)-nchar(pre)-nchar(suf) paste0(pre, sprintf(paste0("%0",len,"d"), num), suf) } 

With an @jeremycg example:

 setDT(dff)[,.( label = label[1], code = do.call(seq_code, tstrsplit(code,'-')) ), by=.(row=seq(nrow(dff)))] 

which gives

  row label code 1: 1 excision of CNS 61000 2: 1 excision of CNS 61001 3: 1 excision of CNS 61002 4: 2 ventricular shunt 0169T 5: 2 ventricular shunt 0170T 6: 2 ventricular shunt 0171T 7: 3 excision of CNS 01000 8: 3 excision of CNS 01001 9: 3 excision of CNS 01002 10: 4 some procedure A2543 11: 4 some procedure A2544 12: 4 some procedure A2545 13: 5 some procedure A0543 14: 5 some procedure A0544 15: 5 some procedure A0545 

Data copied from @jeremycg answer:

 dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", "A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", "excision of CNS", "some procedure", "some procedure")), .Names = c("code", "label"), row.names = c(NA, 5L), class = "data.frame") 
+3
source share

If you are patient enough, you are likely to parse the lines into separate parts instead of the eval / parse trick, alas, I am not so:

 fancy.seq = function(x) eval(parse(text=sub(', \\)', ')', sub('\\(, ', '(', sub('.*?([0-9]+)(.*)-(.*?)([1-9][0-9]*).*', 'paste0("\\3", formatC(\\1:\\4, width=log10(\\4)+1, format="d", flag="0"), "\\2")', x))))) # using example from jeremycg answer dt[, .(fancy.seq(code), label), by = 1:nrow(dt)] # nrow V1 label # 1: 1 61000 excision of CNS # 2: 1 61001 excision of CNS # 3: 1 61002 excision of CNS # 4: 2 0169T ventricular shunt # 5: 2 0170T ventricular shunt # 6: 2 0171T ventricular shunt # 7: 3 01000 excision of CNS # 8: 3 01001 excision of CNS # 9: 3 01002 excision of CNS #10: 4 A2543 some procedure #11: 4 A2544 some procedure #12: 4 A2545 some procedure #13: 5 A0543 some procedure #14: 5 A0544 some procedure #15: 5 A0545 some procedure 

If it is not clear what has been done above, just run the sub commands one by one in one of the lines of code.

+3
source share

A less elegant way to do this:

 # the data hcup <- data.frame(code=c("61000-61003", "0169T-0169T"), label=c("excision of CNS", "ventricular shunt"), stringsAsFactors = F) hcup > code label >1 61000-61003 excision of CNS >2 0169T-0169T ventricular shunt # reshaping # split the code ranges into separate columns seq.ends <- cbind(do.call(rbind.data.frame, strsplit(hcup$code, "-")), hcup$label) # create a list with a data.frame for each original line new.list <- apply(seq.ends, 1, FUN=function(x){data.frame(code=if(grepl("\\d{5}", x[1])){ z<-x[1]:x[2]}else{z<-x[1]}, label=rep(x[3], length(z)), stringsAsFactors = F)}) # collapse the list into a df new.df <- do.call(rbind, lapply(new.list, data.frame, stringsAsFactors=F)) new.df > code label >1.1 61000 excision of CNS >1.2 61001 excision of CNS >1.3 61002 excision of CNS >1.4 61003 excision of CNS >2 0169T ventricular shunt 
+1
source share

All Articles