How to speed up row selection by column value for large Pandas dataframe

I have a large numeric Pandas dataframe df, and I want to highlight rows whose specific column value is in the range min_valueand max_value.

I can do it:

filtered_df = df[(df[col_name].values >= min_value) & (df[col_name].values <= max_value)]

And I'm looking for methods to speed it up. I am trying to do the following:

df.sort(col_name, inplace=True)
left_idx = np.searchsorted(df[col_name].values, min_value, side='left')
right_idx = np.searchsorted(df[col_name].values, max_value, side='right')
filtered_df = df[left_idx:right_idx]

But this does not work for df.sort () costs more time.

So, any tips for speeding up the selection?

(Pandas 0.11)

+4
source share
1 answer

I think it’s best to use numexprto speed it up.

import pandas as pd
import numpy as np
import numexpr as ne

data = np.random.normal(size=100000000)
df = pd.DataFrame(data=data, columns=['col'])
a = df['col']
min_val = a.min()
max_val = a.max()
expr = '(a >= min_val) & (a <= max_val)'

And the timings ...

%timeit eval(expr)
1 loops, best of 3: 668 ms per loop

%timeit ne.evaluate(expr)
1 loops, best of 3: 197 ms per loop
+6
source

All Articles