Alternatives for loops in R?

I have 2 files that I would like to merge using R.

head(bed) chr8 41513235 41513282 ANK1.Exon1 chr8 41518973 41519092 ANK1.Exon2 

The first indicates the intervals and their names. (Chromosome, from, to, name)

 head(coverage) chr1 41513235 20 chr1 41513236 19 chr1 41513237 19 

The second - provides coverage for single bases. (Chromosome, position, coating)

Now I want to get the name of each Exon written next to each item. This will result in some non-Exon line items that I want to delete later.

I figured out how to do what I want. However, they require 3 cycles and about 15 hours of computing time. Since looping is not the best practice in R, I would like to know if anyone knows a better way than:

 coverage <- cbind(coverage, "Exon") coverage[,4] <- NA for(i in 1:nrow(bed)){ for(n in bed[i,2]:bed[i,3]{ for(m in 1:nrow(coverage)){ if(coverage[m,2]==n){ file[m,4] <- bed[i,4] } } } } na.omit(coverage) 

Since all three positions are in the interval "ANK1.Exon1", the output should look like this:

 head(coverage) chr1 41513235 20 ANK1.Exon1 chr1 41513236 19 ANK1.Exon1 chr1 41513237 19 ANK1.Exon1 
+5
source share
3 answers

The fastest way to accomplish what I was looking for:

 library("sqldf") res <- sqldf("select * from coverage f1 inner join bed f2 on(f1.position >=f2.'from' and f1.position <=f2.'to')") 

The calculation time was reduced to seconds. To obtain an accurate result, as indicated above, the data block has been further reduced.

 res <- cbind(res[1:4],res[8]) 

Thank you all for your help.

Edit: for large data sets, the same positions may appear on more than one chromosome, but it is useful to add:

 res <- sqldf("select * from coverage f1 inner join bed f2 on(f1.position >=f2.'from' and f1.position <=f2.'to' and f1.Chromosome = f2.Chromosome)") 
+5
source

this algorithm is linear if the inputs bed and coverage sorted and the input bed does not overlap between words

 > coverage <- read.table("coverage") > bed <- read.table("bed") > > coverage <- cbind(coverage, "Exon") > coverage[,4] <- NA > > i_coverage <- 1 > i_bed <- 1 > > while(i_coverage <= length(coverage[,1]) && i_bed <= length(bed[,1])) { + if(coverage[i_coverage, 2] < bed[i_bed, 2]){ + i_coverage <- i_coverage + 1 + }else{ + #then coverage[i_coverage, 2] >= bed[i_bed, 2] + if(coverage[i_coverage, 2] <= bed[i_bed, 3]){ + coverage[i_coverage,4] <- as.character(bed[i_bed, 4]) + i_coverage <- i_coverage + 1 + }else{ + i_bed <- i_bed + 1 + } + } + } 

You are getting:

 > print(coverage) V1 V2 V3 "Exon" 1 chr1 41513235 20 ANK1.Exon1 2 chr1 41513236 19 ANK1.Exon1 3 chr1 41513237 19 ANK1.Exon1 
+2
source

Using GenomicRanges:

 library("GenomicRanges") #data x1 <- read.table(text="chr1 41513235 41513282 ANK1.Exon1 chr1 41518973 41519092 ANK1.Exon2") x2 <- read.table(text="chr1 41513235 20 chr1 41513236 19 chr1 41513237 19") #Convert to Granges object: g1 <- GRanges(seqnames=x1$V1, IRanges(start=x1$V2, end=x1$V3), Exon=x1$V4) g2 <- GRanges(seqnames=x2$V1, IRanges(start=x2$V2, end=x2$V2), covN=x2$V3) #merge mergeByOverlaps(g1,g2) #output # DataFrame with 3 rows and 4 columns # g1 Exon g2 covN # <GRanges> <factor> <GRanges> <integer> # 1 chr1:*:[41513235, 41513282] ANK1.Exon1 chr1:*:[41513235, 41513235] 20 # 2 chr1:*:[41513235, 41513282] ANK1.Exon1 chr1:*:[41513236, 41513236] 19 # 3 chr1:*:[41513235, 41513282] ANK1.Exon1 chr1:*:[41513237, 41513237] 19 
+2
source

All Articles