Slice pandas df based on n consecutive element instances

Question

Slice pandas df based on n consecutive element instances

I would like to slice a dataframe to return rows where the element x = 0 appears sequentially at least n = 3 times, and then discarding the first i = 2 instance in each mini-sequence

is there an effective way to achieve in pandas, and if not, using numpy or scipy?

import pandas as pd import numpy as np

Example 1

 df=pd.DataFrame({'A':[0,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1],'B':np.random.randn(17)}) AB 0 0 0.748958 1 1 0.254730 2 0 0.629609 3 0 0.272738 4 1 -1.885906 5 1 1.206371 6 0 -0.332471 7 0 0.217553 8 0 0.768986 9 0 -1.607236 10 1 1.613650 11 1 -1.096892 12 0 -0.435762 13 0 0.131284 14 0 -0.177188 15 1 1.393890 16 1 0.174803

Required Conclusion:

  AB 8 0 0.768986 9 0 -1.607236 14 0 -0.177188

Example 2

x = 0 (element of interest)

n = 5 (minimum sequence length)

i = 2 (discard first two in each sequence)

 df2=pd.DataFrame({'A':[0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0],'B':np.random.randn(20)}) AB 0 0 0.703803 1 0 -0.144088 2 0 0.635577 3 0 -0.834611 4 0 1.472271 5 0 -0.554860 6 0 -0.167016 7 1 0.578847 8 1 -1.873663 9 0 0.197062 10 0 1.458845 11 0 -1.921660 12 0 -1.301481 13 0 0.240197 14 0 -1.425058 15 1 -2.801151 16 0 0.766757 17 0 1.249806 18 0 0.595366 19 0 -1.447632

Required Conclusion:

  AB 2 0 0.635577 3 0 -0.834611 4 0 1.472271 5 0 -0.554860 6 0 -0.167016 11 0 -1.921660 12 0 -1.301481 13 0 0.240197 14 0 -1.425058

+6

python numpy scipy pandas

themachinist Oct 11 '15 at 10:03

source share

1 answer

Divakar · Accepted Answer · 2015-10-11T10:37:28+0000

Here's an approach using some NumPy manipulation -

 def slice_consc(df,n): Acol = np.array(df['A']) Acol_ext = np.concatenate(([0],(Acol==0)+0,[0])) starts = np.where(np.diff(Acol_ext)==1)[0] stops = np.where(np.diff(Acol_ext)==-1)[0] id = np.zeros(Acol.size+2,dtype=int) valid_mask = stops - starts >= n id[stops[valid_mask]] = -1 id[starts[valid_mask]+2] = 1 return df[(id.cumsum()==1)[:-2]]

Run Examples -

Case No. 1:

 >>> df AB 0 0 0.977325 1 1 -0.408457 2 0 -0.377973 3 0 0.567537 4 1 -0.222019 5 1 -1.167422 6 0 -0.142546 7 0 0.675458 8 0 -0.184456 9 0 -0.826050 10 1 -0.772413 11 1 -1.556440 12 0 -0.687249 13 0 -0.481676 14 0 0.420400 15 1 0.031999 16 1 -1.092540 >>> slice_consc(df,3) AB 8 0 -0.184456 9 0 -0.826050 14 0 0.420400

Case No. 2:

 >>> df2 AB 0 0 0.757102 1 0 2.114935 2 0 -0.352309 3 0 -0.214931 4 0 -1.626064 5 0 -0.989776 6 0 0.639635 7 1 0.049358 8 1 -2.600326 9 0 0.057792 10 0 1.263418 11 0 0.618495 12 0 -1.637054 13 0 1.220862 14 0 1.245484 15 1 1.388218 16 0 -0.499900 17 0 0.761310 18 0 -1.308176 19 0 -2.005983 >>> slice_consc(df2,5) AB 2 0 -0.352309 3 0 -0.214931 4 0 -1.626064 5 0 -0.989776 6 0 0.639635 11 0 0.618495 12 0 -1.637054 13 0 1.220862 14 0 1.245484

Slice pandas df based on n consecutive element instances

More articles: