Consolidate strings based on date ranges

I would like to combine the rows of the data frame so that the ranges described by the "start" and "end" columns include all the values ​​from the original dataset. There may be overlays, repeats, and nested ranges. Some ranges may be missing.

Here is an example of the data I would like to collapse:

data = data.frame(rbind( c("Roger", 1, 10), c("Roger", 10, 15), c("Roger", 16, 17), c("Roger", 3, 6), c("Roger", 20, 25), c("Roger", NA, NA), c("Susan", 2, 8))) names(data) = c("name", "start", "end") data$start = as.numeric(as.character(data$start)) data$end = as.numeric(as.character(data$end)) 

Desired Result:

 name start end Roger 1 17 Roger 20 25 Susan 2 8 

My attempt is to expand each element in a range for each row. This works, but then I'm not sure how to compress it back. In addition, the complete dataset I'm working with has ~ 30 million rows and very large ranges, so this method is VERY slow.

 pb <- txtProgressBar(min = 0, max = length(data$name), style = 3) mylist = list() for(i in 1:length(data$name)){ subdata = data[i,] if(is.na(subdata$start)){ mylist[[i]] = subdata mylist[[i]]$each = NA } if(!is.na(subdata$start)){ sequence = seq(subdata$start, subdata$end) mylist[[i]] = subdata[rep(1, each = length(sequence)),] mylist[[i]]$daily = sequence } setTxtProgressBar(pb, i) } rbindlist(mylist) 
+6
source share
1 answer

I assume IRanges is much more efficient for this, but ...

 library(data.table) # remove missing values DT = na.omit(setDT(data)) # sort setorder(DT, name, start) # mark threshold for a new group DT[, high_so_far := shift(cummax(end), fill=end[1L]), by=name] # group and summarise DT[, .(start[1L], end[.N]), by=.( name, g = cumsum(start > high_so_far + 1L) )] # name g V1 V2 # 1: Roger 0 1 17 # 2: Roger 1 20 25 # 3: Susan 1 2 8 

How it works:

  • cummax is the cumulative maximum, so the highest value so far, including the current line.
  • To accept a value excluding the current line, use shift (which is extracted from the previous line).
  • cumsum(some_condition) is the standard way to create a grouping variable.
  • .N is the last line of the group defined by= .

Columns can be named at the last stage, for example .(s = start[1L], e = end[.N]) .


With time intervals . If you work with dates, I suggest the IDate class; just use as.IDate to convert a Date .

We can +1 by date, but unfortunately we can’t cummax , so ...

 cummax_idate = function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate"))) set.seed(1) d = sample(as.IDate("2011-11-11") + 1:10) cummax_idate(d) # [1] "2011-11-14" "2011-11-15" "2011-11-16" "2011-11-18" "2011-11-18" # [6] "2011-11-19" "2011-11-20" "2011-11-20" "2011-11-21" "2011-11-21" 

I think this function can be used instead of cummax .

There is an extra () in the function because setattr will not output its output.

+10
source

All Articles