Combining DataFrames on several conditions - not only on equal values

Firstly, I'm sorry if this is a little long, but I wanted to fully describe what I'm having problems with and what I have already tried.

I am trying to combine (merge) two dataframe objects under several conditions. I know how to do this if the conditions that must be met are all β€œequal” operators, however I need to use LESS, THAN, and MORE.

The data is genetic information: one is a list of mutations in the genome (called SNP), and the other contains information about the locations of genes in the human genome. Running df.head () on them returns the following:

SNP DataFrame (snp_df):

chromosome SNP BP 0 1 rs3094315 752566 1 1 rs3131972 752721 2 1 rs2073814 753474 3 1 rs3115859 754503 4 1 rs3131956 758144 

Displays the SNP link identifier and their location. "BP" means the position of the "Base-Pair".

Gene DataFrame (gene_df):

  chromosome chr_start chr_stop feature_id 0 1 10954 11507 GeneID:100506145 1 1 12190 13639 GeneID:100652771 2 1 14362 29370 GeneID:653635 3 1 30366 30503 GeneID:100302278 4 1 34611 36081 GeneID:645520 

This information frame shows the locations of all the genes of interest.

What I want to know is all the SNPs that fall into the genes in the genome and drop those that are outside of these regions.

If I wanted to combine two data frames based on multiple (equal) conditions, I would do something like the following:

 merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns']) 

However, in this case, I need to find the SNP where the chromosome values ​​correspond to the values ​​in the Gene data block, and the BP value is between "chr_start" and "chr_stop". What makes this difficult is that this data is very large. In this current dataset, snp_df has 6795021 lines, and gene_df has 34362.

I tried to cope with this by looking at the chromosomes or genes separately. There are 22 different chromosome values ​​(ints 1-22) since sex chromosomes are not used. Both methods take a very long time. One uses the pandasql module, while the other approach is to loop through individual genes.

SQL Method

 import pandas as pd import pandasql as psql pysqldf = lambda q: psql.sqldf(q, globals()) q = """ SELECT s.SNP, g.feature_id FROM this_snp s INNER JOIN this_genes g WHERE s.BP >= g.chr_start AND s.BP <= g.chr_stop; """ all_dfs = [] for chromosome in snp_df['chromosome'].unique(): this_snp = snp_df.loc[snp_df['chromosome'] == chromosome] this_genes = gene_df.loc[gene_df['chromosome'] == chromosome] genic_snps = pysqldf(q) all_dfs.append(genic_snps) all_genic_snps = pd.concat(all_dfs) 

Gene Iteration Method

 all_dfs = [] for line in gene_df.iterrows(): info = line[1] # Getting the Series object this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) & (snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])] if this_snp.shape[0] != 0: this_snp = this_snp[['SNP']] this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id']) all_dfs.append(this_snp) all_genic_snps = pd.concat(all_dfs) 

Can anyone give any suggestions on an effective way to do this?

+6
source share
2 answers

I just thought about how to solve this - combining my two methods:

Focus on the individual chromosomes first, and then swipe through the genes in these smaller data frames. It also does not have to use any SQL queries. I also included a section to immediately identify any redundant genes that do not have SNPs that fall into their range. This uses a double for loop, which I usually try to avoid, but in this case it works quite well.

 all_dfs = [] for chromosome in snp_df['chromosome'].unique(): this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome] this_genes = gene_df.loc[gene_df['chromosome'] == chromosome] # Getting rid of redundant genes min_bp = this_chr_snp['BP'].min() max_bp = this_chr_snp['BP'].max() this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) & ~(this_genes['chr_stop'] <= min_bp)] for line in this_genes.iterrows(): info = line[1] this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) & (this_chr_snp['BP'] <= info['chr_stop'])] if this_snp.shape[0] != 0: this_snp = this_snp[['SNP']] this_snp.insert(1, 'feature_id', info['feature_id']) all_dfs.append(this_snp) all_genic_snps = pd.concat(all_dfs) 

Although this does not work spectacularly fast - it works, so that I can get some answers. I would still like to know if anyone has any tips to make it more efficient though.

+2
source

You can use the following to accomplish what you are looking for:

 merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner') merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']] 

Note. Your sample frames do not match your join criteria. The following is an example of using modified data frames:

 snp_df Out[193]: chromosome SNP BP 0 1 rs3094315 752566 1 1 rs3131972 30400 2 1 rs2073814 753474 3 1 rs3115859 754503 4 1 rs3131956 758144 gene_df Out[194]: chromosome chr_start chr_stop feature_id 0 1 10954 11507 GeneID:100506145 1 1 12190 13639 GeneID:100652771 2 1 14362 29370 GeneID:653635 3 1 30366 30503 GeneID:100302278 4 1 34611 36081 GeneID:645520 merged_df Out[195]: SNP feature_id 8 rs3131972 GeneID:100302278 
+1
source

All Articles