Python pandas: why is a map faster?

in pandas', there is an example of using indexing:

In [653]: criterion = df2['a'].map(lambda x: x.startswith('t')) In [654]: df2[criterion] 

then Wes wrote:

 **# equivalent but slower** In [655]: df2[[x.startswith('t') for x in df2['a']]] 

can someone explain why the map approach is faster? Is this a python function or is it a pandas function?

+7
python pandas
source share
1 answer

The arguments about why a certain way of doing things in Python “should be” faster should not be taken too seriously, because you often measure implementation details that can behave differently in certain situations. As a result, when people guess what should be faster, they are often (usually?) Wrong. For example, I found that map might be slower. Using this setup code:

 import numpy as np, pandas as pd import random, string def make_test(num, width): s = [''.join(random.sample(string.ascii_lowercase, width)) for i in range(num)] df = pd.DataFrame({"a": s}) return df 

Let's compare the time they take to make an indexing object - whether it be Series or list - and the resulting time it takes to use this indexing object in a DataFrame . For example, creating a list is quick, but before you use it as an index, you need to convert it internally to Series or ndarray or something else, and extra time is added there.

First, for a small frame:

 >>> df = make_test(10, 10) >>> %timeit df['a'].map(lambda x: x.startswith('t')) 10000 loops, best of 3: 85.8 µs per loop >>> %timeit [x.startswith('t') for x in df['a']] 100000 loops, best of 3: 15.6 µs per loop >>> %timeit df['a'].str.startswith("t") 10000 loops, best of 3: 118 µs per loop >>> %timeit df[df['a'].map(lambda x: x.startswith('t'))] 1000 loops, best of 3: 304 µs per loop >>> %timeit df[[x.startswith('t') for x in df['a']]] 10000 loops, best of 3: 194 µs per loop >>> %timeit df[df['a'].str.startswith("t")] 1000 loops, best of 3: 348 µs per loop 

and in this case listcomp is the fastest. Honestly, this does not surprise me too much, because going through lambda will most likely be slower than using str.startswith directly, but it's really hard to guess. 10 is small enough, we are probably still measuring things like Series installation costs; what happens in a larger frame?

 >>> df = make_test(10**5, 10) >>> %timeit df['a'].map(lambda x: x.startswith('t')) 10 loops, best of 3: 46.6 ms per loop >>> %timeit [x.startswith('t') for x in df['a']] 10 loops, best of 3: 27.8 ms per loop >>> %timeit df['a'].str.startswith("t") 10 loops, best of 3: 48.5 ms per loop >>> %timeit df[df['a'].map(lambda x: x.startswith('t'))] 10 loops, best of 3: 47.1 ms per loop >>> %timeit df[[x.startswith('t') for x in df['a']]] 10 loops, best of 3: 52.8 ms per loop >>> %timeit df[df['a'].str.startswith("t")] 10 loops, best of 3: 49.6 ms per loop 

And now it seems that map wins when used as an index, although the difference is not significant. But not so fast: what if we manually turn listcomp into array or Series ?

 >>> %timeit df[np.array([x.startswith('t') for x in df['a']])] 10 loops, best of 3: 40.7 ms per loop >>> %timeit df[pd.Series([x.startswith('t') for x in df['a']])] 10 loops, best of 3: 37.5 ms per loop 

and now listcomp wins again!

Conclusion: who knows? But never believe anything without timeit results, and even then you should ask if you check what you think.

+17
source share

All Articles