Collapse of intersecting regions in R

I am trying to find a way to collapse rows with intersecting ranges indicated by the "start" and "stop" columns, and write the collapsed values ​​to the new columns. For example, I have this data frame:

my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), name=c("a","b","c","d","e","f","g"), start=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), stop=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952))) chrom name start stop 1 a 0 71200 1 b 70001 71200 1 c 70203 80001 1 d 70060 71051 14 e 40004 42004 16 f 50000872 50000890 16 g 50000872 51000952 

And I'm trying to find overlapping ranges and write the largest range covered by collapsed overlapping lines in "start" and "stop" and the names of the collapsed lines, so I would get the following:

 chrom start stop name 1 70001 80001 a,b,c,d 14 40004 42004 e 16 50000872 51000952 f,g 

I think I could use IRanges packages as follows:

 library(IRanges) ranges <- split(IRanges(my.df$start, my.df$stop), my.df$chrom) 

But then I have problems getting collapsed columns: I tried with findOvarlaps, but this

 ov <- findOverlaps(ranges, ranges, type="any") 

but I don’t think it is right.

Any help would be greatly appreciated.

Thanks! -fra

+8
r overlap
source share
2 answers

IRanges is a good candidate for such a job. There is no need to use the chrom variable.

 ir <- IRanges(my.df$start, my.df$stop) ## I create a new grouping variable Note the use of reduce here(performance issue) my.df$group2 <- subjectHits(findOverlaps(ir, reduce(ir))) # chrom name start stop group2 # 1 1 a 70001 71200 2 # 2 1 b 70203 80001 2 # 3 1 c 70060 71051 2 # 4 14 d 40004 42004 1 # 5 16 e 50000872 50000890 3 # 6 16 f 50000872 51000952 3 

The new variable group2 is a range indicator. Now, using data.table , I cannot convert my data to the desired result:

 library(data.table) DT <- as.data.table(my.df) DT[, list(start=min(start),stop=max(stop), name=list(name),chrom=unique(chrom)), by=group2] # group2 start stop name chrom # 1: 2 70001 80001 a,b,c 1 # 2: 1 40004 42004 d 14 # 3: 3 50000872 51000952 e,f 16 

PS: the name of the sleeping variable is not a string here, but a list of the factor. This is more efficient and easier to access than a collapsed symbol, for example, using a paste.

EDIT afcter OP, I will create a varibale group by chrome. I mean, now the Iranges code is called for each chrome group. I am slightly modifying your data to create a group of intervals of the same chromosome.

 my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), name=c("a","b","c","d","e","f","g"), start=as.numeric(c(0,3000,70203,70060, 40004, 50000872, 50000872)), stop=as.numeric(c(1,5000,80001,71051, 42004, 50000890, 51000952))) library(data.table) DT <- as.data.table(my.df) ## find interval for each chromsom DT[,group := { ir <- IRanges(start, stop); subjectHits(findOverlaps(ir, reduce(ir))) },by=chrom] ## Now I group by group and chrom DT[, list(start=min(start),stop=max(stop),name=list(name),chrom=unique(chrom)), by=list(group,chrom)] group chrom start stop name chrom 1: 1 1 0 1 a 1 2: 2 1 3000 5000 b 1 3: 3 1 70060 80001 c,d 1 4: 1 14 40004 42004 e 14 5: 1 16 50000872 51000952 f,g 16 
+9
source share

After sorting the data, you can easily test if the interval overlaps the previous one and assign a label to each set of overlapping intervals. When you have these labels, you can use ddply to aggregate data.

 d <- data.frame( chrom = c(1,1,1,14,16,16), name = c("a","b","c","d","e","f"), start = as.numeric(c(70001,70203,70060, 40004, 50000872, 50000872)), stop = as.numeric(c(71200,80001,71051, 42004, 50000890, 51000952)) ) # Make sure the data is sorted d <- d[ order(d$start), ] # Check if a record should be linked with the previous d$previous_stop <- c(NA, d$stop[-nrow(d)]) d$previous_stop <- cummax(ifelse(is.na(d$previous_stop),0,d$previous_stop)) d$new_group <- is.na(d$previous_stop) | d$start >= d$previous_stop # The number of the current group of records is the number of times we have switched to a new group d$group <- cumsum( d$new_group ) # We can now aggregate the data library(plyr) ddply( d, "group", summarize, start=min(start), stop=max(stop), name=paste(name,collapse=",") ) # group start stop name # 1 1 0 80001 a,d,c,b # 2 2 50000872 51000952 e,f 

But this ignores the chrom column: to account for it, you can do the same for each chromosome separately.

 d <- d[ order(d$chrom, d$start), ] d <- ddply( d, "chrom", function(u) { x <- c(NA, u$stop[-nrow(u)]) y <- ifelse( is.na(x), 0, x ) y <- cummax(y) y[ is.na(x) ] <- NA u$previous_stop <- y u } ) d$new_group <- is.na(d$previous_stop) | d$start >= d$previous_stop d$group <- cumsum( d$new_group ) ddply( d, .(chrom,group), summarize, start=min(start), stop=max(stop), name=paste(name,collapse=",") ) # chrom group start stop name # 1 1 1 0 80001 a,c,b # 2 14 2 40004 42004 d # 3 16 3 50000872 51000952 e,f 
+5
source share

All Articles