How to combine two pandas frames (or carry values) by comparing value ranges

Question

How to combine two pandas frames (or carry values) by comparing value ranges

In the following data:

data01 = contig start end haplotype_block 2 5207 5867 1856 2 155667 155670 2816 2 67910 68022 2 2 68464 68483 3 2 525 775 132 2 118938 119559 1157 data02 = contig start last feature gene_id gene_name transcript_id 2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1 2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1 2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1 2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1 2 916 1201 exon scaffold_200001.1 NA scaffold_200001.1 2 614 789 exon scaffold_200001.1 NA scaffold_200001.1 2 171 435 exon scaffold_200001.1 NA scaffold_200001.1 2 2677 2806 exon scaffold_200002.1 NA scaffold_200002.1 2 2899 3125 exon scaffold_200002.1 NA scaffold_200002.1

Problem:

I want to compare ranges (start-end) with these two data frames.
If the ranges overlap, I want to pass the values of gene_id and gene_name from data02 to a new column in data01.

I tried (using pandas):

 data01['gene_id'] = "" data01['gene_name'] = "" data01['gene_id'] = data01['gene_id'].\ apply(lambda x: data02['gene_id']\ if range(data01['start'], data01['end'])\ <= range(data02['start'], data02['last']) else 'NA')

How can I improve this code? I am currently sticking to pandas, but if the problem is better solved with a dictionary, I am open to it. But please explain this process, I am open to learning, not just to answer.

Thanks,

Desired conclusion:

 contig start end haplotype_block gene_id gene_name 2 5207 5867 1856 scaffold_200003.1,scaffold_200003.1,scaffold_200003.1 CP5,CP5,CP5 # the gene_id and gene_name are repeated 3 times because three intervals (ie 5262-5496, 5579-5750, 5856-6032) from data02 overlap(or touch) the interval ranges from data01 (5207-5867) # So, whenever there is overlap of the ranges between two dataframe, copy the gene_id and gene_name. # and simply NA on gene_id and gene_name for non overlapping ranges 2 155667 155670 2816 NA NA 2 67910 68022 2 NA NA 2 68464 68483 3 NA NA 2 525 775 132 scaffold_200001.1 NA 2 118938 119559 1157 NA NA

+3

python merge pandas dataframe bioinformatics

everestial007 Apr 18 '17 at 14:45

source share

3 answers

I understand that you are using python, but your problem can be easily solved with the classic bioinformatics tool bedtools intersect : http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

Both of your input files follow standard BED formats: http://bedtools.readthedocs.io/en/latest/content/general-usage.html

Intersection Bedtools gives you extended logic to determine what an intersection or overlap between two regions is. I believe that it can also work directly on bgzipped input.

+4

Gordon bean Apr 18 '17 at 15:29

source share

You have to use the function of interval trees in python, they are very efficient and memory compatible, I tried something like this, ran it in some kind of problem that was later solved, but here is the code I wrote, Using the interval tree to search overlapping areas

you can create this code.

+1

sbradbio Apr 18 '17 at 22:40

source share

piRSquared · Accepted Answer · 2017-04-20T09:46:33+0000

 s1 = data01.start.values e1 = data01.end.values s2 = data02.start.values e2 = data02['last'].values overlap = ( (s1[:, None] <= s2) & (e1[:, None] >= s2) ) | ( (s1[:, None] <= e2) & (e1[:, None] >= e2) ) g = data02.gene_id.values n = data02.gene_name.values i, j = np.where(overlap) idx_map = {i_: data01.index[i_] for i_ in pd.unique(i)} def make_series(m): s = pd.Series(m[j]).fillna('').groupby(i).agg(','.join) return s.rename_axis(idx_map).replace('', np.nan) data01.assign( gene_id=make_series(g), gene_name=make_series(n), )

How to combine two pandas frames (or carry values) by comparing value ranges

More articles: