Pandas: delete strings based on other strings

Question

Pandas: delete strings based on other strings

I have a pandas dataframe that looks like this:

qseqid sseqid qstart qend 2 1 125 345 4 1 150 320 3 2 150 450 6 2 25 300 8 2 50 500

I would like to delete rows based on different row values with these criteria: Row (r1) should be deleted if another row (r2) exists with the same sseqid and r1[qstart] > r2[qstart] and r1[qend] < r2[qend] .

Is this possible with pandas?

+6

python pandas dataframe

jsgounot Aug 30 '16 at 9:25

source share

1 answer

unutbu · Accepted Answer · 2016-08-30T10:33:11+0000

 df = pd.DataFrame({'qend': [345, 320, 450, 300, 500], 'qseqid': [2, 4, 3, 6, 8], 'qstart': [125, 150, 150, 25, 50], 'sseqid': [1, 1, 2, 2, 2]}) def remove_rows(df): merged = pd.merge(df.reset_index(), df, on='sseqid') mask = ((merged['qstart_x'] > merged['qstart_y']) & (merged['qend_x'] < merged['qend_y'])) df_mask = ~df.index.isin(merged.loc[mask, 'index'].values) result = df.loc[df_mask] return result result = remove_rows(df) print(result)

gives

  qend qseqid qstart sseqid 0 345 2 125 1 3 300 6 25 2 4 500 8 50 2

The idea is to use pd.merge to form a DataFrame with each pairing of rows with the same sseqid :

 In [78]: pd.merge(df.reset_index(), df, on='sseqid') Out[78]: index qend_x qseqid_x qstart_x sseqid qend_y qseqid_y qstart_y 0 0 345 2 125 1 345 2 125 1 0 345 2 125 1 320 4 150 2 1 320 4 150 1 345 2 125 3 1 320 4 150 1 320 4 150 4 2 450 3 150 2 450 3 150 5 2 450 3 150 2 300 6 25 6 2 450 3 150 2 500 8 50 7 3 300 6 25 2 450 3 150 8 3 300 6 25 2 300 6 25 9 3 300 6 25 2 500 8 50 10 4 500 8 50 2 450 3 150 11 4 500 8 50 2 300 6 25 12 4 500 8 50 2 500 8 50

Each merge row contains data from two df rows. Then you can compare every two lines using

 mask = ((merged['qstart_x'] > merged['qstart_y']) & (merged['qend_x'] < merged['qend_y']))

and find the labels in df.index that do not meet this condition:

 df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)

and select these lines:

 result = df.loc[df_mask]

Note that this assumes df has a unique index.

Pandas: delete strings based on other strings

More articles: