Iterating over the pandas series is done forever, but I can't think of a way to solve this problem without it. Is there a faster way?

Question

Iterating over the pandas series is done forever, but I can't think of a way to solve this problem without it. Is there a faster way?

I have a series of pandas consecutive numbers, something like

import pandas as pd
D = pd.Series([2, 3, 4, 4, 5, 4, 3, 2, 3, 4, 5, 4, 3, 2, 1, 0],
    index=pd.date_range(start='2015-01-02 12:00:00', periods=16, freq='s'))
D

2015-01-02 12:00:00    2
2015-01-02 12:00:01    3
2015-01-02 12:00:02    4
2015-01-02 12:00:03    4
2015-01-02 12:00:04    5
2015-01-02 12:00:05    4
2015-01-02 12:00:06    3
2015-01-02 12:00:07    2
2015-01-02 12:00:08    3
2015-01-02 12:00:09    4
2015-01-02 12:00:10    5
2015-01-02 12:00:11    4
2015-01-02 12:00:12    3
2015-01-02 12:00:13    2
2015-01-02 12:00:14    1
2015-01-02 12:00:15    0
Freq: S, dtype: int64

However, my specific dataset may have a million rows. I am interested in answering the following question for each line:

For each row iand a fixed positive number, slet sis the next index after isuch that D[S]<=D[i]-s. What is the maximum value D[j]-D[i]for jbetween iand s?

, , , , , , - s. , s= 2 2.

, . -, D,

def pmax(Ddiff, i, s):
    values = Ddiff[i+1:].iteritems()
    pmaxvalue = 0
    pcurrentvalue=0
    while pcurrentvalue>s:
        try:
            pcurrentvalue += values.next()[1]
        except StopIteration:
            return pmaxvalue
        pmaxvalue = max(pmaxvalue, pcurrentvalue)
    return pmaxvalue

peaks = []
for i in range(len(D)):
    peaks.append(pmax(Ddiff, i, -2))

. .

, .

- , ?

+4

optimization vectorization iteration pandas

crf 16 . '15 20:27

1

Jakob · Answer 1 · 2015-01-17T22:19:13+0000

, :

10 → 1,3
100 → 31.5ms
1000 → 2160
5000 → 52500 .

numpy ,

# create big dataframe
n = 1000
D = pd.Series(np.random.randint(0,6,n),
    index=pd.date_range(start='2015-01-02 12:00:00', periods=n, freq='s'))
# compute differences
Ddiff = D.diff()
# compute cumulative sum of differences
Dsum = Ddiff.cumsum()
# set initial value from nan to 0
Dsum[0] = 0

def pmax2(Dsum, D, i, s):
    # find index S
    S = np.argmax(D.iloc[i+1:]<=D.iloc[i]-s)
    # get max amplitude
    l = np.max(Dsum.iloc[i:i+S+1]-Dsum.iloc[i])
    return l

# compute for all individual entries
peaks = []
# use xrange instead of range
for i in xrange(len(D)-1):
    peaks.append(pmax2(Dsum, D, i, 2))
peaks.append(0)

10 → 1.79
100 → 19.7ms
1000 → 200
5000 → 1040

, for.

Iterating over the pandas series is done forever, but I can't think of a way to solve this problem without it. Is there a faster way?

More articles: