Pandas how to find continuous values ​​in a series whose differences are at some distance

I have a pandas Series consisting of int s

 a = np.array([1,2,3,5,7,10,13,16,20]) pd.Series(a) 0 1 1 2 2 3 3 5 4 7 5 10 6 13 7 16 8 20 

now I want to group the series into groups that in each group differ between two adjacent <= distance values. For example, if the distance is defined as 1 , we have

 [1,2,3], [5], [7], [10], [13], [16], [20] 

if the distance is 2 , we have

 [1,2,3,5,7], [10], [13], [16], [20] 

if the distance is 3 , we have

 [1,2,3,5,7,10,13,16], [20] 

how to do it using pandas / numpy ?

+7
python numpy pandas
source share
2 answers

Here's one approach -

 np.split(a,np.flatnonzero(np.diff(a)>d)+1) 

As a function to display a list of lists -

 def splitme(a,d) : return list(map(list,np.split(a,np.flatnonzero(np.diff(a)>d)+1))) 

For performance, I would suggest using zip to get the start, stop the indexes and then chop, thereby avoiding np.split , which could be a bottleneck -

 def splitme_zip(a,d) : m = np.concatenate(([True],a[1:] > a[:-1] + d,[True])) idx = np.flatnonzero(m) l = a.tolist() return [l[i:j] for i,j in zip(idx[:-1],idx[1:])] 

If you need the output as a list of arrays, skip the list conversion using .tolist / map(list,) .

Run Examples -

 In [122]: a = np.array([1,2,3,5,7,10,13,16,20]) In [123]: splitme(a,1) Out[123]: [[1, 2, 3], [5], [7], [10], [13], [16], [20]] In [124]: splitme(a,2) Out[124]: [[1, 2, 3, 5, 7], [10], [13], [16], [20]] In [125]: splitme(a,3) Out[125]: [[1, 2, 3, 5, 7, 10, 13, 16], [20]] 

Runtime Test -

 In [180]: a = np.sort(np.random.randint(1,10000*2,(10000))) In [181]: s = pd.Series(a) In [182]: d = 3 In [183]: %timeit pandas_way(s,d) #@cᴏʟᴅsᴘᴇᴇᴅ soln 10 loops, best of 3: 55.1 ms per loop In [184]: %timeit np.split(a,np.flatnonzero(np.diff(a)>d)+1) ...: %timeit splitme(a,d) ...: %timeit splitme_zip(a,d) 1000 loops, best of 3: 1.47 ms per loop 100 loops, best of 3: 2.87 ms per loop 1000 loops, best of 3: 516 µs per loop In [185]: a Out[185]: array([ 2, 2, 2, ..., 19992, 19996, 19999]) 
+7
source share

This is the pandas way using groupby .

 n = 1 s 0 1 1 2 2 3 3 5 4 7 5 10 6 13 7 16 8 20 dtype: int64 m = ~s.diff().fillna(0).le(n) v = s.groupby(m.cumsum()).apply(lambda x: x.tolist()).tolist() v [[1, 2, 3], [5], [7], [10], [13], [16], [20]] 
+2
source share

All Articles