How to speed up Pandas sum of layered data?

I am trying to speed up the sum for several large tiered frames.

Here is an example:

df1 = mul_df(5000,30,400) # mul_df to create a big multilevel dataframe
#let df2, df3, df4 = df1, df1, df1 to minimize the memory usage, 
#they can also be mul_df(5000,30,400) 
df2, df3, df4 = df1, df1, df1

In [12]: timeit df1+df2+df3+df4
1 loops, best of 3: 993 ms per loop

I'm not satisfied with 993ms, is there a way to speed things up? Can cython improve performance? If so, how to write cython code? Thanks.

Note : mul_df()- This is a function for creating a demo multi-level frame.

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst

Update:

Data on a dual-core Pentium processor with two cores T4200@2.00GHZ , 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 (Anaconda 1.5.0 (32-bit))

In [1]: from pandas.core import expressions as expr
In [2]: import numexpr as ne

In [3]: df1 = mul_df(5000,30,400)
In [4]: df2, df3, df4 = df1, df1, df1

In [5]: expr.set_use_numexpr(False)
In [6]: %timeit df1+df2+df3+df4
1 loops, best of 3: 1.06 s per loop

In [7]: expr.set_use_numexpr(True)
In [8]: %timeit df1+df2+df3+df4
1 loops, best of 3: 986 ms per loop

In [9]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
1 loops, best of 3: 388 ms per loop
+2
source share
1 answer

1: ( numexpr )

In [41]: from pandas.core import expressions as expr

In [42]: expr.set_use_numexpr(False)

In [43]: %timeit df1+df2+df3+df4
1 loops, best of 3: 349 ms per loop

2: numexpr ( , numexpr)

In [44]: expr.set_use_numexpr(True)

In [45]: %timeit df1+df2+df3+df4
10 loops, best of 3: 173 ms per loop

3: numexpr

In [34]: import numexpr as ne

In [46]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
10 loops, best of 3: 47.7 ms per loop

numexpr, :

  • ( , , ,   numpy, , ((df1+df2)+df3)+df4

, pandas numexpr ops ( 0.11), . df1 + df2 , numexpr ( 2 , 1.). ( 3) ne.evaluate(...) .

, pandas 0,13 (0.12 ), pd.eval, , . ( , : https://github.com/pydata/pandas/pull/4037)

In [5]: %timeit pd.eval('df1+df2+df3+df4')
10 loops, best of 3: 50.9 ms per loop

, , cython ; numexpr ( , , cython )

: Numexpr (Numexpr numpy ). dtype

+9

All Articles