Iterating over strings in pandas frame date in a cleaner way using .iterrows () and tracking strings between specific values

I have a pandas framework in python 2.7 and I want to iterate over strings and get the time between two types of events, as well as the number of other events between them (under certain conditions).

My data pandas.DateFramethat look like this:

     Time  Var1  EvntType  Var2
0    15    1     2         17
1    19    1     1         45
2    21    6     2         43
3    23    3     2         65
4    25    0     2         76 #this one should be skipped
5    26    2     2         35
6    28    3     2         25
7    31    5     1         16
8    33    1     2         25
9    36    5     1         36
10   39    1     2         21

Where I want to ignore lines where Var10 is equal, and then count the time between events of type 1 and events of type 2 (except when Var1 == 0) between events of type 1. Thus, in the above case:

Start_time: 19, Time_inbetween: 12, Event_count: 4
Start_time: 31, Time_inbetween: 5, Event_count: 1

I do it as follows:

i=0
eventCounter = 0
lastStartTime = 0
length = data[data['EvntType']==1].shape[0]
results = np.zeros((length,3),dtype=int)
for row in data[data['Var1'] > 0].iterrows():
    myRow = row[1]
    if myRow['EvntType'] == 1:
        results[i,0] = lastStartTime
        results[i,1] = myRow['Time'] - lastStartTime
        results[i,2] = eventCounter
        lastStartTime = myRow['Time']
        eventCounter = 0
        i += 1
    else:
        eventCounter += 1

which gives me the desired result:

>>> results[1:]
array([[19, 12,  4],
       [31,  5,  1]])

But this seems really workaround and takes a lot of time on large data frames. How can I improve this?

+4
1

, Var1 0, :

df = df.loc[df['Var1'] != 0]

, True, EvntType 1:

mask = df['EvntType']==1
# 0     False
# 1      True
# ...
# 9      True
# 10    False
# Name: EvntType, dtype: bool

Time, , mask True:

times = df.loc[mask, 'Time']
# 1    19
# 7    31
# 9    36
# Name: Time, dtype: int64

, mask - True:

idx = np.flatnonzero(mask)
# array([1, 6, 8])

start_time - times[:-1].

In [56]: times[:-1]
Out[56]: 
1    19
7    31
Name: Time, dtype: int64

time_inbetween - , np.diff(times)

In [55]: np.diff(times)
Out[55]: array([12,  5])

event_count - idx, 1.

In [57]: np.diff(idx)-1
Out[57]: array([4, 1])

import numpy as np
import pandas as pd

df = pd.DataFrame({'EvntType': [2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2],
                   'Time': [15, 19, 21, 23, 25, 26, 28, 31, 33, 36, 39],
                   'Var1': [1, 1, 6, 3, 0, 2, 3, 5, 1, 5, 1],
                   'Var2': [17, 45, 43, 65, 76, 35, 25, 16, 25, 36, 21]})

# Remove rows where Var1 equals 0
df = df.loc[df['Var1'] != 0]

mask = df['EvntType']==1
times = df.loc[mask, 'Time']
idx = np.flatnonzero(mask)

result = pd.DataFrame(
    {'start_time': times[:-1],
     'time_inbetween': np.diff(times),
     'event_count': np.diff(idx)-1})

print(result)

   event_count  start_time  time_inbetween
1            4          19              12
7            1          31               5
+5

All Articles