Is there a way to speed up the next pandas for loop?

Question

Is there a way to speed up the next pandas for loop?

My data frame contains 10,000,000 lines! After grouping, ~ 9,000,000 subframes remain in the loop.

The code:

 data = read.csv('big.csv') for id, new_df in data.groupby(level=0): # look at mini df and do some analysis # some code for each of the small data frames

This is super inefficient and the code has been running for 10 hours.

Is there any way to speed it up?

Full code:

 d = pd.DataFrame() # new df to populate print 'Start of the loop' for id, new_df in data.groupby(level=0): c = [new_df.iloc[i:] for i in range(len(new_df.index))] x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index() x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()]) d = pd.concat([d, x])

To get data:

 data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])

Note:

Most of the identifier will have only 1 date. This indicates only one visit. For id with a lot of visits, I would like to structure them in 3d format, for example. store all your visits in the second dimension of 3. Output (id, visits, functions)

+7

python numpy pandas

GRS Mar 16 '18 at 10:06

source share

3 answers

I believe that your approach to designing functions could be better, but I will try to answer your question.

In Python , dictionary iteration is much faster than data frame iteration

Here's how I managed to process the huge DataFrame panda (~ 100,000,000 rows):

 # reset the Dataframe index to get level 0 back as a column in your dataset df = data.reset_index() # the index will be (id, date) # split the DataFrame based on id # and store the splits as Dataframes in a dictionary using id as key d = dict(tuple(df.groupby('id'))) # iterate over the Dictionary and process the values for key, value in d.items(): pass # each value is a Dataframe # concat the values and get the original (processed) Dataframe back df2 = pd.concat(d.values(), ignore_index=True)

+2

Abdulrahman bres Mar 19 '18 at 0:45

source share

Modified @Stephen Code

 def make_3d(dataset): def make_3d_lines(a_df): a_df['depth'] = 0 # sets all depth from (1 to n) to 0 depth = 1 # initiate from 1, so that the first loop is correct prev = None accum = [] # accumulates blocks of data belonging to given user for row in a_df.values.tolist(): # for each row in our dataset row[0] = 0 # NOT SURE key = row[1] # this is the id of the row if key == prev: # if this rows id matches previous row id, append together depth += 1 accum.append(row) else: # else if this id is new, previous block is completed -> process it if depth == 0: # previous id appeared only once -> get that row from accum yield accum[0] # also remember that depth = 0 else: # process the block and emit each row depth = 0 to_emit = [] # prepare to emit the list for i in range(len(accum)): # for each unique day in the accumulated list date = accum[i][2] # define date to be the first date it sees for j, r in enumerate(accum[i:]): to_emit.append(list(r)) to_emit[-1][0] = j # define the depth to_emit[-1][2] = date # define the for r in to_emit[0:]: yield r accum = [row] prev = key df_data = dataset.reset_index() df_data.columns = ['depth'] + list(df_data.columns)[1:] new_df = pd.DataFrame( make_3d_lines(df_data.sort_values('id date'.split(), ascending=[True,False])), columns=df_data.columns ).astype(dtype=df_data.dtypes.to_dict()) return new_df.set_index('id date'.split())

Testing:

 t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,3,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20170501,20180304], 'feature':[10,20,45,1,14,15,20,20,13,11],'result':[1,1,0,0,0,0,1,0,1,1]}) t = t.reindex(columns=['id','date','feature','result']) print t id date feature result 0 1 20180311 10 1 1 1 20180310 20 1 2 1 20180210 45 0 3 1 20170505 1 0 4 2 20180312 14 0 5 2 20180311 15 0 6 3 20180312 20 1 7 3 20180311 20 0 8 4 20170501 13 1 9 5 20180304 11 1

Exit

  depth feature result id date 1 20180311 0 10 1 20180311 1 20 1 20180311 2 45 0 20180311 3 1 0 20180310 0 20 1 20180310 1 45 0 20180310 2 1 0 20180210 0 45 0 20180210 1 1 0 20170505 0 1 0 2 20180312 0 14 0 20180312 1 15 0 20180311 0 15 0 3 20180312 0 20 1 20180312 1 20 0 20180311 0 20 0 4 20170501 0 13 1

0

GRS Mar 20 '18 at 15:45

source share

Stephen rachch · Accepted Answer · 2018-03-18T22:55:43+0000

Here is one way to speed it up. This adds the required newlines to some code that processes the lines directly. This saves the overhead of constantly building small data frames. Your 100,000-line sample runs in a couple of seconds on my machine. As long as your code contains only 10,000 lines of your sample data, it takes> 100 seconds. This, apparently, represents an improvement of several orders of magnitude.

The code:

 def make_3d(csv_filename): def make_3d_lines(a_df): a_df['depth'] = 0 depth = 0 prev = None accum = [] for row in a_df.values.tolist(): row[0] = 0 key = row[1] if key == prev: depth += 1 accum.append(row) else: if depth == 0: yield row else: depth = 0 to_emit = [] for i in range(len(accum)): date = accum[i][2] for j, r in enumerate(accum[i:]): to_emit.append(list(r)) to_emit[-1][0] = j to_emit[-1][2] = date for r in to_emit[1:]: yield r accum = [row] prev = key df_data = pd.read_csv('big-data.csv') df_data.columns = ['depth'] + list(df_data.columns)[1:] new_df = pd.DataFrame( make_3d_lines(df_data.sort_values('id date'.split())), columns=df_data.columns ).astype(dtype=df_data.dtypes.to_dict()) return new_df.set_index('id date'.split())

Security Code:

 start_time = time.time() df = make_3d('big-data.csv') print(time.time() - start_time) df = df.drop(columns=['feature%d' % i for i in range(3, 25)]) print(df[df['depth'] != 0].head(10))

Results:

 1.7390995025634766 depth feature0 feature1 feature2 id date 207555809644681 20180104 1 0.03125 0.038623 0.008130 247833985674646 20180106 1 0.03125 0.004378 0.004065 252945024181083 20180107 1 0.03125 0.062836 0.065041 20180107 2 0.00000 0.001870 0.008130 20180109 1 0.00000 0.001870 0.008130 329567241731951 20180117 1 0.00000 0.041952 0.004065 20180117 2 0.03125 0.003101 0.004065 20180117 3 0.00000 0.030780 0.004065 20180118 1 0.03125 0.003101 0.004065 20180118 2 0.00000 0.030780 0.004065

Is there a way to speed up the next pandas for loop?

The code:

Security Code:

Results:

More articles: