Pandas: How to make application in DataFrame faster?

Question

Pandas: How to make application in DataFrame faster?

Consider the pandas example, where I compute column C by multiplying A by B and a float if a certain condition is met using apply with the lambda function:

 import pandas as pd df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9],'B':[9,8,7,6,5,4,3,2,1]}) df['C'] = df.apply(lambda x: xA if xB > 5 else 0.1*xA*xB, axis=1)

Expected Result:

  ABC 0 1 9 1.0 1 2 8 2.0 2 3 7 3.0 3 4 6 4.0 4 5 5 2.5 5 6 4 2.4 6 7 3 2.1 7 8 2 1.6 8 9 1 0.9

The problem is that this code is slow and I need to do this operation on a data frame with approximately 56 million rows.

Result %timeit is the result of the above lambda operation:

 1000 loops, best of 3: 1.63 ms per loop

Based on the computation time, as well as the memory usage when doing this on my large data frame, I assume that this operation uses intermediate rows when doing the calculations.

I tried to formulate it in different ways, including using temporary columns, but every alternative solution I came up with is even slower.

Is there a way to get the result that I need in a different and faster way, for example. using numpy ?

+7

python python-2.7 numpy pandas apply

Khris Jan 11 '17 at 10:14

source share

4 answers

pure pandas
using pd.Series.where

 df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1)) ABC 0 1 9 1.0 1 2 8 2.0 2 3 7 3.0 3 4 6 4.0 4 5 5 2.5 5 6 4 2.4 6 7 3 2.1 7 8 2 1.6 8 9 1 0.9

+4

piRSquared Jan 11 '17 at 10:20

source share

Using numpy.where :

 df['C'] = numpy.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])

+3

Ians Jan 11 '17 at 10:18

source share

Using:

 df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1)) print (df) ABC 0 1 9 1.0 1 2 8 2.0 2 3 7 3.0 3 4 6 4.0 4 5 5 2.5 5 6 4 2.4 6 7 3 2.1 7 8 2 1.6 8 9 1 0.9

+2

jezrael Jan 11 '17 at 10:18

source share

Divakar · Accepted Answer · 2017-01-11T10:16:38+0000

For performance, you might be better off working with a NumPy array and using np.where -

 a = df.values # Assuming you have two columns A and B df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])

Runtime test

 def numpy_based(df): a = df.values # Assuming you have two columns A and B df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])

Dates -

 In [271]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [272]: %timeit numpy_based(df) 1000 loops, best of 3: 380 µs per loop In [273]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [274]: %timeit df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1)) 100 loops, best of 3: 3.39 ms per loop In [275]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [276]: %timeit df['C'] = np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B']) 1000 loops, best of 3: 1.12 ms per loop In [277]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [278]: %timeit df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1)) 1000 loops, best of 3: 1.19 ms per loop

Look closer

Let's take a closer look at the NumPy crunching function and compare it with pandas in a mix -

 # Extract out as array (its a view, so not really expensive # .. as compared to the later computations themselves) In [291]: a = df.values In [296]: %timeit df.values 10000 loops, best of 3: 107 µs per loop

Case # 1: Working with a NumPy Array and Using numpy.where:

 In [292]: %timeit np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1]) 10000 loops, best of 3: 86.5 µs per loop

Again, assignment to a new column: df['C'] also not very expensive -

 In [300]: %timeit df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1]) 1000 loops, best of 3: 323 µs per loop

Case # 2: Work with pandas data framework and use its .where method (no NumPy)

 In [293]: %timeit df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1)) 100 loops, best of 3: 3.4 ms per loop

Case # 3: Working with a pandas data file (without a NumPy array), but use numpy.where -

 In [294]: %timeit np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B']) 1000 loops, best of 3: 764 µs per loop

Case # 4: work again with pandas dataframe (no NumPy array), but use numpy.where -

 In [295]: %timeit np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1)) 1000 loops, best of 3: 830 µs per loop

Pandas: How to make application in DataFrame faster?

More articles: