How to use dict for a subset of a DataFrame?

Say I gave a DataFrame, with most of the columns being categorical data.

> data.head() age risk sex smoking 0 28 no male no 1 58 no female no 2 27 no male yes 3 26 no male no 4 29 yes female yes 

And I would like a subset of this data using a pair of key-value pairs for these categorical variables.

 tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'} 

Therefore, I would like to have the following subset.

 data[ (data.risk == 'no') & (data.smoking == 'yes') & (data.sex == 'female')] 

What I want to do:

 data[tmp] 

What is the way python / pandas do this?


Minimal example:

 import numpy as np import pandas as pd from pandas import Series, DataFrame x = Series(random.randint(0,2,50), dtype='category') x.cat.categories = ['no', 'yes'] y = Series(random.randint(0,2,50), dtype='category') y.cat.categories = ['no', 'yes'] z = Series(random.randint(0,2,50), dtype='category') z.cat.categories = ['male', 'female'] a = Series(random.randint(20,60,50), dtype='category') data = DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a}) tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'} 
+7
python pandas dataframe categorical-data
source share
5 answers

I would use a method . query () for this task:

 In [103]: qry = ' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()]) In [104]: qry Out[104]: "sex == 'female' and risk == 'no' and smoking == 'yes'" In [105]: data.query(qry) Out[105]: age risk sex smoking 7 24 no female yes 22 43 no female yes 23 42 no female yes 25 24 no female yes 32 29 no female yes 40 34 no female yes 43 35 no female yes 
+3
source share

You can create a search frame from a dictionary, and then do an internal join with data , which will have the same effect as query :

 from pandas import merge, DataFrame merge(DataFrame(tmp, index =[0]), data) 

enter image description here

+3
source share

You can use list comprehension with concat and all :

 import numpy as np import pandas as pd np.random.seed(123) x = pd.Series(np.random.randint(0,2,10), dtype='category') x.cat.categories = ['no', 'yes'] y = pd.Series(np.random.randint(0,2,10), dtype='category') y.cat.categories = ['no', 'yes'] z = pd.Series(np.random.randint(0,2,10), dtype='category') z.cat.categories = ['male', 'female'] a = pd.Series(np.random.randint(20,60,10), dtype='category') data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a}) print (data) age risk sex smoking 0 24 no male yes 1 23 yes male yes 2 22 no female no 3 40 no female yes 4 59 no female no 5 22 no male yes 6 40 no female no 7 27 yes male yes 8 55 yes male yes 9 48 no male no 
 tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'} mask = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1) print (mask) 0 False 1 False 2 False 3 True 4 False 5 False 6 False 7 False 8 False 9 False dtype: bool df1 = data[mask] print (df1) age risk sex smoking 3 40 no female yes 
 L = [(x[0], x[1]) for x in tmp.items()] print (L) [('smoking', 'yes'), ('sex', 'female'), ('risk', 'no')] L = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1) print (L) smoking sex risk 0 True False True 1 True False False 2 False True True 3 True True True 4 False True True 5 True False True 6 False True True 7 True False False 8 True False False 9 False False True 

Dates :

len(data)=1M .

 N = 1000000 np.random.seed(123) x = pd.Series(np.random.randint(0,2,N), dtype='category') x.cat.categories = ['no', 'yes'] y = pd.Series(np.random.randint(0,2,N), dtype='category') y.cat.categories = ['no', 'yes'] z = pd.Series(np.random.randint(0,2,N), dtype='category') z.cat.categories = ['male', 'female'] a = pd.Series(np.random.randint(20,60,N), dtype='category') data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a}) #[1000000 rows x 4 columns] print (data) tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'} In [133]: %timeit (data[pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1)]) 10 loops, best of 3: 89.1 ms per loop In [134]: %timeit (data.query(' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()]))) 1 loop, best of 3: 237 ms per loop In [135]: %timeit (pd.merge(pd.DataFrame(tmp, index =[0]), data.reset_index()).set_index('index')) 1 loop, best of 3: 256 ms per loop 
+3
source share

You can create a logical vector that validates these attributes. Probably the best way:

 df[risk == 'no' and smoking == 'yes' and sex == 'female' for (age, risk, sex, smoking) in df.itertuples()] 
+2
source share

I think you can use the to_dict method on your data framework and then filter using list comprehension:

 df = pd.DataFrame(data={'age':[28, 29], 'sex':["M", "F"], 'smoking':['y', 'n']}) print df tmp = {'age': 28, 'smoking': 'y', 'sex': 'M'} print pd.DataFrame([i for i in df.to_dict('records') if i == tmp]) >>> age sex smoking 0 28 M y 1 29 F n age sex smoking 0 28 M y 

You can also convert tmp to a series:

 ts = pd.Series(tmp) print pd.DataFrame([i[1] for i in df.iterrows() if i[1].equals(ts)]) 
0
source share

All Articles