Pandas: Is it possible to filter a frame with arbitrarily long Boolean criteria?

Question

Pandas: Is it possible to filter a frame with arbitrarily long Boolean criteria?

If you know exactly how you want to filter the data frame, the solution is trivial:

df[(df.A == 1) & (df.B == 1)]

But what if you accept user input and don’t know in advance how many criteria the user wants to use? For example, the user wants a filtered data frame, where the columns are [A, B, C] == 1. Is it possible to do something like:

 def filterIt(*args, value): return df[(df.*args == value)]

therefore, if the user calls filterIt(A, B, C, value=1) , he returns:

 df[(df.A == 1) & (df.B == 1) & (df.C == 1)]

+7

python pandas

yobogoya Feb 09 '16 at 22:29

source share

4 answers

Here is another approach. It is cleaner, more efficient, and has the advantage that columns can be empty (in this case, the entire data frame is returned).

 def filter(df, value, *columns): return df.loc[df.loc[:, columns].eq(value).all(axis=1)]

Explanation

values = df.loc[:, columns] selects only the columns of interest to us.
masks = values.eq(value) provides a buffer data frame indicating equality with the target value.
mask = masks.all(axis=1) applies AND over the columns (returns the index mask). Note that you can use masks.any(axis=1) for OR.
return df.loc[mask] applies the index mask to the data frame.

Demo

 import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(0, 2, (100, 3)), columns=list('ABC')) # both columns assert np.all(filter(df, 1, 'A', 'B') == df[(df.A == 1) & (df.B == 1)]) # no columns assert np.all(filter(df, 1) == df) # different values per column assert np.all(filter(df, [1, 0], 'A', 'B') == df[(df.A == 1) & (df.B == 0)])

Alternative

For a small number of columns (<5), the next solution based on steven answer is more efficient than above, although less flexible. As-is, it will not work for an empty set of columns and will not work using different values for each column.

 from operator import and_ def filter(df, value, *columns): return df.loc[reduce(and_, (df[column] == value for column in columns))]

Getting the Series object with the key ( df[column] ) is much faster than creating a DataFrame object around a subset of columns ( df.loc[:, columns] ).

 In [4]: %timeit df['A'] == 1 100 loops, best of 3: 17.3 ms per loop In [5]: %timeit df.loc[:, ['A']] == 1 10 loops, best of 3: 48.6 ms per loop

However, this acceleration becomes negligible when working with a large number of columns. The bottleneck becomes ANDing masks together, for which reduce(and_, ...) much slower than Pandas builtin all(axis=1) .

+5

Igor Raush Feb 10 '16 at 0:04

source share

It's pretty dirty, but it seems to work.

 import operator def filterIt(value,args): stuff = [getattr(b,thing) == value for thing in args] return reduce(operator.and_, stuff) a = {'A':[1,2,3],'B':[2,2,2],'C':[3,2,1]} b = pd.DataFrame(a) filterIt(2,['A','B','C']) 0 False 1 True 2 False dtype: bool (bA == 2) & (bB == 2) & (bC ==2) 0 False 1 True 2 False dtype: bool

+1

steven Feb 09 '16 at 23:01

source share

Thanks for helping the guys. I came up with something similar to Marius, learning about df.query ():

 def makeQuery(cols, equivalence=True, *args): operator = ' == ' if equivalence else ' != ' query = '' for arg in args: for col in cols: query = query + "({}{}{})".format(col, operator, arg) + ' & ' return query[:-3] query = makeQuery([A, B, C], False, 1, 2)

The content of the request is the line:

 (A != 1) & (B != 1) & (C != 1) & (A != 2) & (B != 2) & (C != 2)

which can be passed to df.query (query)

+1

yobogoya Feb 09 '16 at 23:36

source share

Marius · Accepted Answer · 2016-02-09T23:08:49+0000

I think the most elegant way to do this is to use df.query() , where you can create a string with all your conditions, for example:

 import pandas as pd import numpy as np cols = {} for col in ('A', 'B', 'C', 'D', 'E'): cols[col] = np.random.randint(1, 5, 20) df = pd.DataFrame(cols) def filter_df(df, filter_cols, value): conditions = [] for col in filter_cols: conditions.append('{c} == {v}'.format(c=col, v=value)) query_expr = ' and '.join(conditions) print('querying with: {q}'.format(q=query_expr)) return df.query(query_expr)

Sample output (your results may vary due to randomly generated data):

 filter_df(df, ['A', 'B'], 1) querying with: A == 1 and B == 1 ABCDE 6 1 1 1 2 1 11 1 1 2 3 4

Pandas: Is it possible to filter a frame with arbitrarily long Boolean criteria?

More articles: