How to use dict for a subset of a DataFrame?

Question

How to use dict for a subset of a DataFrame?

Say I gave a DataFrame, with most of the columns being categorical data.

> data.head() age risk sex smoking 0 28 no male no 1 58 no female no 2 27 no male yes 3 26 no male no 4 29 yes female yes

And I would like a subset of this data using a pair of key-value pairs for these categorical variables.

 tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}

Therefore, I would like to have the following subset.

 data[ (data.risk == 'no') & (data.smoking == 'yes') & (data.sex == 'female')]

What I want to do:

 data[tmp]

What is the way python / pandas do this?

Minimal example:

 import numpy as np import pandas as pd from pandas import Series, DataFrame x = Series(random.randint(0,2,50), dtype='category') x.cat.categories = ['no', 'yes'] y = Series(random.randint(0,2,50), dtype='category') y.cat.categories = ['no', 'yes'] z = Series(random.randint(0,2,50), dtype='category') z.cat.categories = ['male', 'female'] a = Series(random.randint(20,60,50), dtype='category') data = DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a}) tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}

+7

python pandas dataframe categorical-data

Thomas Möbius Oct 18 '16 at 15:01

source share

5 answers

You can create a search frame from a dictionary, and then do an internal join with data , which will have the same effect as query :

 from pandas import merge, DataFrame merge(DataFrame(tmp, index =[0]), data)

+3

Psidom Oct 18 '16 at 15:31

source share

You can use list comprehension with concat and all :

 import numpy as np import pandas as pd np.random.seed(123) x = pd.Series(np.random.randint(0,2,10), dtype='category') x.cat.categories = ['no', 'yes'] y = pd.Series(np.random.randint(0,2,10), dtype='category') y.cat.categories = ['no', 'yes'] z = pd.Series(np.random.randint(0,2,10), dtype='category') z.cat.categories = ['male', 'female'] a = pd.Series(np.random.randint(20,60,10), dtype='category') data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a}) print (data) age risk sex smoking 0 24 no male yes 1 23 yes male yes 2 22 no female no 3 40 no female yes 4 59 no female no 5 22 no male yes 6 40 no female no 7 27 yes male yes 8 55 yes male yes 9 48 no male no

 tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'} mask = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1) print (mask) 0 False 1 False 2 False 3 True 4 False 5 False 6 False 7 False 8 False 9 False dtype: bool df1 = data[mask] print (df1) age risk sex smoking 3 40 no female yes

 L = [(x[0], x[1]) for x in tmp.items()] print (L) [('smoking', 'yes'), ('sex', 'female'), ('risk', 'no')] L = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1) print (L) smoking sex risk 0 True False True 1 True False False 2 False True True 3 True True True 4 False True True 5 True False True 6 False True True 7 True False False 8 True False False 9 False False True

Dates :

len(data)=1M .

 N = 1000000 np.random.seed(123) x = pd.Series(np.random.randint(0,2,N), dtype='category') x.cat.categories = ['no', 'yes'] y = pd.Series(np.random.randint(0,2,N), dtype='category') y.cat.categories = ['no', 'yes'] z = pd.Series(np.random.randint(0,2,N), dtype='category') z.cat.categories = ['male', 'female'] a = pd.Series(np.random.randint(20,60,N), dtype='category') data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a}) #[1000000 rows x 4 columns] print (data) tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'} In [133]: %timeit (data[pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1)]) 10 loops, best of 3: 89.1 ms per loop In [134]: %timeit (data.query(' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()]))) 1 loop, best of 3: 237 ms per loop In [135]: %timeit (pd.merge(pd.DataFrame(tmp, index =[0]), data.reset_index()).set_index('index')) 1 loop, best of 3: 256 ms per loop

+3

jezrael Oct 19 '16 at 8:15

source share

You can create a logical vector that validates these attributes. Probably the best way:

 df[risk == 'no' and smoking == 'yes' and sex == 'female' for (age, risk, sex, smoking) in df.itertuples()]

+2

Patrick haugh Oct 18 '16 at 15:17

source share

I think you can use the to_dict method on your data framework and then filter using list comprehension:

 df = pd.DataFrame(data={'age':[28, 29], 'sex':["M", "F"], 'smoking':['y', 'n']}) print df tmp = {'age': 28, 'smoking': 'y', 'sex': 'M'} print pd.DataFrame([i for i in df.to_dict('records') if i == tmp]) >>> age sex smoking 0 28 M y 1 29 F n age sex smoking 0 28 M y

You can also convert tmp to a series:

 ts = pd.Series(tmp) print pd.DataFrame([i[1] for i in df.iterrows() if i[1].equals(ts)])

0

kezzos Oct 19 '16 at 8:45

source share

Maxu · Accepted Answer · 2016-10-18T15:33:25+0000

I would use a method . query () for this task:

 In [103]: qry = ' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()]) In [104]: qry Out[104]: "sex == 'female' and risk == 'no' and smoking == 'yes'" In [105]: data.query(qry) Out[105]: age risk sex smoking 7 24 no female yes 22 43 no female yes 23 42 no female yes 25 24 no female yes 32 29 no female yes 40 34 no female yes 43 35 no female yes

How to use dict for a subset of a DataFrame?

More articles: