Pandas: delete strings based on other strings

I have a pandas dataframe that looks like this:

qseqid sseqid qstart qend 2 1 125 345 4 1 150 320 3 2 150 450 6 2 25 300 8 2 50 500 

I would like to delete rows based on different row values ​​with these criteria: Row (r1) should be deleted if another row (r2) exists with the same sseqid and r1[qstart] > r2[qstart] and r1[qend] < r2[qend] .

Is this possible with pandas?

+6
source share
1 answer
 df = pd.DataFrame({'qend': [345, 320, 450, 300, 500], 'qseqid': [2, 4, 3, 6, 8], 'qstart': [125, 150, 150, 25, 50], 'sseqid': [1, 1, 2, 2, 2]}) def remove_rows(df): merged = pd.merge(df.reset_index(), df, on='sseqid') mask = ((merged['qstart_x'] > merged['qstart_y']) & (merged['qend_x'] < merged['qend_y'])) df_mask = ~df.index.isin(merged.loc[mask, 'index'].values) result = df.loc[df_mask] return result result = remove_rows(df) print(result) 

gives

  qend qseqid qstart sseqid 0 345 2 125 1 3 300 6 25 2 4 500 8 50 2 

The idea is to use pd.merge to form a DataFrame with each pairing of rows with the same sseqid :

 In [78]: pd.merge(df.reset_index(), df, on='sseqid') Out[78]: index qend_x qseqid_x qstart_x sseqid qend_y qseqid_y qstart_y 0 0 345 2 125 1 345 2 125 1 0 345 2 125 1 320 4 150 2 1 320 4 150 1 345 2 125 3 1 320 4 150 1 320 4 150 4 2 450 3 150 2 450 3 150 5 2 450 3 150 2 300 6 25 6 2 450 3 150 2 500 8 50 7 3 300 6 25 2 450 3 150 8 3 300 6 25 2 300 6 25 9 3 300 6 25 2 500 8 50 10 4 500 8 50 2 450 3 150 11 4 500 8 50 2 300 6 25 12 4 500 8 50 2 500 8 50 

Each merge row contains data from two df rows. Then you can compare every two lines using

 mask = ((merged['qstart_x'] > merged['qstart_y']) & (merged['qend_x'] < merged['qend_y'])) 

and find the labels in df.index that do not meet this condition:

 df_mask = ~df.index.isin(merged.loc[mask, 'index'].values) 

and select these lines:

 result = df.loc[df_mask] 

Note that this assumes df has a unique index.

+7
source

All Articles