Consolidate strings based on date ranges

Question

Consolidate strings based on date ranges

I would like to combine the rows of the data frame so that the ranges described by the "start" and "end" columns include all the values from the original dataset. There may be overlays, repeats, and nested ranges. Some ranges may be missing.

Here is an example of the data I would like to collapse:

data = data.frame(rbind( c("Roger", 1, 10), c("Roger", 10, 15), c("Roger", 16, 17), c("Roger", 3, 6), c("Roger", 20, 25), c("Roger", NA, NA), c("Susan", 2, 8))) names(data) = c("name", "start", "end") data$start = as.numeric(as.character(data$start)) data$end = as.numeric(as.character(data$end))

Desired Result:

 name start end Roger 1 17 Roger 20 25 Susan 2 8

My attempt is to expand each element in a range for each row. This works, but then I'm not sure how to compress it back. In addition, the complete dataset I'm working with has ~ 30 million rows and very large ranges, so this method is VERY slow.

 pb <- txtProgressBar(min = 0, max = length(data$name), style = 3) mylist = list() for(i in 1:length(data$name)){ subdata = data[i,] if(is.na(subdata$start)){ mylist[[i]] = subdata mylist[[i]]$each = NA } if(!is.na(subdata$start)){ sequence = seq(subdata$start, subdata$end) mylist[[i]] = subdata[rep(1, each = length(sequence)),] mylist[[i]]$daily = sequence } setTxtProgressBar(pb, i) } rbindlist(mylist)

+6

date r dataframe data.table

Nancy Aug 19 '16 at 19:11

source share

1 answer

Frank · Accepted Answer · 2016-08-19T20:07:08+0000

I assume IRanges is much more efficient for this, but ...

 library(data.table) # remove missing values DT = na.omit(setDT(data)) # sort setorder(DT, name, start) # mark threshold for a new group DT[, high_so_far := shift(cummax(end), fill=end[1L]), by=name] # group and summarise DT[, .(start[1L], end[.N]), by=.( name, g = cumsum(start > high_so_far + 1L) )] # name g V1 V2 # 1: Roger 0 1 17 # 2: Roger 1 20 25 # 3: Susan 1 2 8

How it works:

cummax is the cumulative maximum, so the highest value so far, including the current line.
To accept a value excluding the current line, use shift (which is extracted from the previous line).
cumsum(some_condition) is the standard way to create a grouping variable.
.N is the last line of the group defined by= .

Columns can be named at the last stage, for example .(s = start[1L], e = end[.N]) .

With time intervals . If you work with dates, I suggest the IDate class; just use as.IDate to convert a Date .

We can +1 by date, but unfortunately we can’t cummax , so ...

 cummax_idate = function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate"))) set.seed(1) d = sample(as.IDate("2011-11-11") + 1:10) cummax_idate(d) # [1] "2011-11-14" "2011-11-15" "2011-11-16" "2011-11-18" "2011-11-18" # [6] "2011-11-19" "2011-11-20" "2011-11-20" "2011-11-21" "2011-11-21"

I think this function can be used instead of cummax .

There is an extra () in the function because setattr will not output its output.

Consolidate strings based on date ranges

More articles: