df = pd.DataFrame({'qend': [345, 320, 450, 300, 500], 'qseqid': [2, 4, 3, 6, 8], 'qstart': [125, 150, 150, 25, 50], 'sseqid': [1, 1, 2, 2, 2]}) def remove_rows(df): merged = pd.merge(df.reset_index(), df, on='sseqid') mask = ((merged['qstart_x'] > merged['qstart_y']) & (merged['qend_x'] < merged['qend_y'])) df_mask = ~df.index.isin(merged.loc[mask, 'index'].values) result = df.loc[df_mask] return result result = remove_rows(df) print(result)
gives
qend qseqid qstart sseqid 0 345 2 125 1 3 300 6 25 2 4 500 8 50 2
The idea is to use pd.merge to form a DataFrame with each pairing of rows with the same sseqid :
In [78]: pd.merge(df.reset_index(), df, on='sseqid') Out[78]: index qend_x qseqid_x qstart_x sseqid qend_y qseqid_y qstart_y 0 0 345 2 125 1 345 2 125 1 0 345 2 125 1 320 4 150 2 1 320 4 150 1 345 2 125 3 1 320 4 150 1 320 4 150 4 2 450 3 150 2 450 3 150 5 2 450 3 150 2 300 6 25 6 2 450 3 150 2 500 8 50 7 3 300 6 25 2 450 3 150 8 3 300 6 25 2 300 6 25 9 3 300 6 25 2 500 8 50 10 4 500 8 50 2 450 3 150 11 4 500 8 50 2 300 6 25 12 4 500 8 50 2 500 8 50
Each merge row contains data from two df rows. Then you can compare every two lines using
mask = ((merged['qstart_x'] > merged['qstart_y']) & (merged['qend_x'] < merged['qend_y']))
and find the labels in df.index that do not meet this condition:
df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
and select these lines:
result = df.loc[df_mask]
Note that this assumes df has a unique index.