How to combine two pandas frames (or carry values) by comparing value ranges

In the following data:

data01 = contig start end haplotype_block 2 5207 5867 1856 2 155667 155670 2816 2 67910 68022 2 2 68464 68483 3 2 525 775 132 2 118938 119559 1157 data02 = contig start last feature gene_id gene_name transcript_id 2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1 2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1 2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1 2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1 2 916 1201 exon scaffold_200001.1 NA scaffold_200001.1 2 614 789 exon scaffold_200001.1 NA scaffold_200001.1 2 171 435 exon scaffold_200001.1 NA scaffold_200001.1 2 2677 2806 exon scaffold_200002.1 NA scaffold_200002.1 2 2899 3125 exon scaffold_200002.1 NA scaffold_200002.1 

Problem:

  • I want to compare ranges (start-end) with these two data frames.
  • If the ranges overlap, I want to pass the values โ€‹โ€‹of gene_id and gene_name from data02 to a new column in data01.

I tried (using pandas):

 data01['gene_id'] = "" data01['gene_name'] = "" data01['gene_id'] = data01['gene_id'].\ apply(lambda x: data02['gene_id']\ if range(data01['start'], data01['end'])\ <= range(data02['start'], data02['last']) else 'NA') 

How can I improve this code? I am currently sticking to pandas, but if the problem is better solved with a dictionary, I am open to it. But please explain this process, I am open to learning, not just to answer.

Thanks,

Desired conclusion:

 contig start end haplotype_block gene_id gene_name 2 5207 5867 1856 scaffold_200003.1,scaffold_200003.1,scaffold_200003.1 CP5,CP5,CP5 # the gene_id and gene_name are repeated 3 times because three intervals (ie 5262-5496, 5579-5750, 5856-6032) from data02 overlap(or touch) the interval ranges from data01 (5207-5867) # So, whenever there is overlap of the ranges between two dataframe, copy the gene_id and gene_name. # and simply NA on gene_id and gene_name for non overlapping ranges 2 155667 155670 2816 NA NA 2 67910 68022 2 NA NA 2 68464 68483 3 NA NA 2 525 775 132 scaffold_200001.1 NA 2 118938 119559 1157 NA NA 
+3
python merge pandas dataframe bioinformatics
source share
3 answers
 s1 = data01.start.values e1 = data01.end.values s2 = data02.start.values e2 = data02['last'].values overlap = ( (s1[:, None] <= s2) & (e1[:, None] >= s2) ) | ( (s1[:, None] <= e2) & (e1[:, None] >= e2) ) g = data02.gene_id.values n = data02.gene_name.values i, j = np.where(overlap) idx_map = {i_: data01.index[i_] for i_ in pd.unique(i)} def make_series(m): s = pd.Series(m[j]).fillna('').groupby(i).agg(','.join) return s.rename_axis(idx_map).replace('', np.nan) data01.assign( gene_id=make_series(g), gene_name=make_series(n), ) 

enter image description here

+1
source share

I understand that you are using python, but your problem can be easily solved with the classic bioinformatics tool bedtools intersect : http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

Both of your input files follow standard BED formats: http://bedtools.readthedocs.io/en/latest/content/general-usage.html

Intersection Bedtools gives you extended logic to determine what an intersection or overlap between two regions is. I believe that it can also work directly on bgzipped input.

+4
source share

You have to use the function of interval trees in python, they are very efficient and memory compatible, I tried something like this, ran it in some kind of problem that was later solved, but here is the code I wrote, Using the interval tree to search overlapping areas

you can create this code.

+1
source share

All Articles