Firstly, I'm sorry if this is a little long, but I wanted to fully describe what I'm having problems with and what I have already tried.
I am trying to combine (merge) two dataframe objects under several conditions. I know how to do this if the conditions that must be met are all βequalβ operators, however I need to use LESS, THAN, and MORE.
The data is genetic information: one is a list of mutations in the genome (called SNP), and the other contains information about the locations of genes in the human genome. Running df.head () on them returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP 0 1 rs3094315 752566 1 1 rs3131972 752721 2 1 rs2073814 753474 3 1 rs3115859 754503 4 1 rs3131956 758144
Displays the SNP link identifier and their location. "BP" means the position of the "Base-Pair".
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id 0 1 10954 11507 GeneID:100506145 1 1 12190 13639 GeneID:100652771 2 1 14362 29370 GeneID:653635 3 1 30366 30503 GeneID:100302278 4 1 34611 36081 GeneID:645520
This information frame shows the locations of all the genes of interest.
What I want to know is all the SNPs that fall into the genes in the genome and drop those that are outside of these regions.
If I wanted to combine two data frames based on multiple (equal) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this case, I need to find the SNP where the chromosome values ββcorrespond to the values ββin the Gene data block, and the BP value is between "chr_start" and "chr_stop". What makes this difficult is that this data is very large. In this current dataset, snp_df has 6795021 lines, and gene_df has 34362.
I tried to cope with this by looking at the chromosomes or genes separately. There are 22 different chromosome values ββ(ints 1-22) since sex chromosomes are not used. Both methods take a very long time. One uses the pandasql module, while the other approach is to loop through individual genes.
SQL Method
import pandas as pd import pandasql as psql pysqldf = lambda q: psql.sqldf(q, globals()) q = """ SELECT s.SNP, g.feature_id FROM this_snp s INNER JOIN this_genes g WHERE s.BP >= g.chr_start AND s.BP <= g.chr_stop; """ all_dfs = [] for chromosome in snp_df['chromosome'].unique(): this_snp = snp_df.loc[snp_df['chromosome'] == chromosome] this_genes = gene_df.loc[gene_df['chromosome'] == chromosome] genic_snps = pysqldf(q) all_dfs.append(genic_snps) all_genic_snps = pd.concat(all_dfs)
Gene Iteration Method
all_dfs = [] for line in gene_df.iterrows(): info = line[1] # Getting the Series object this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) & (snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])] if this_snp.shape[0] != 0: this_snp = this_snp[['SNP']] this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id']) all_dfs.append(this_snp) all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions on an effective way to do this?