Custom Boolean filtering in Pandas?

Question

Custom Boolean filtering in Pandas?

I have a dataframe

0 1 2 3 Marketcap 0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B 1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B 2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M 3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M 4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k 5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1

Is there some kind of custom filtering method that would let Python know B> M> K?

Say what I want to filter, df[df.Marketcap > 35.00M] , is there a smart or clean way to do this? The presence of M or B makes the value very readable and easily distinguishable.

Thanks.

EDIT: resumed flow as Max U's response, while excellent seems to be creating a pandas error, and we opened the problem on Github.

+3

pandas filtering

Moondra May 08 '17 at 1:40

source share

3 answers

This is not super clean, but it does the trick and doesn't use any iteration in python:

The code:

 # Create a separate column (which you can omit later) that converts 'Marketcap' strings to numbers df['cap'] = df.loc[df['Marketcap'].str.contains('B'), 'Marketcap'].str.replace('B','').astype(float) * 1000 df['cap'].fillna(df.loc[df['Marketcap'].str.contains('M'), 'Marketcap'].str.replace('M',''), inplace = True) # For pandas pre-0.20.0 (<May 2017) print df.ix[df['cap'].astype(float) > 35, :-1] # For pandas 0.20.0+ (.ix[] deprecated) print df.iloc[df[df['cap'].astype(float) > 35].index, :-1] # Or, alternate pandas 0.20.0+ option (thanks @Psidom) print df[df['cap'].astype(float) > 35].iloc[:,:-1]

Output:

  0 1 2 3 4 Marketcap 0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B 1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B 4 2.029370 0.899612 0.261146 1.474148 -1.663970 100.9M

+3

pshep123 May 08 '17 at 2:00

source share

UPDATE:

 In [44]: df Out[44]: 0 1 2 3 4 Marketcap 0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B 1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B 2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M 3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M 4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k 5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1 In [45]: df[pd.eval(df.Marketcap.replace(['[Kk]','[Mm]','[Bb]'], ['*10**3','*10**6','*10**9'], regex=True) \ .add(' < 35*10**6'))] Out[45]: 0 1 2 3 4 Marketcap 2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M 3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M 4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k 5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1

I would do it like this:

 In [13]: df[pd.eval(df.Marketcap.replace(['M','B'],['','*1000'], regex=True).add(' > 35'))] Out[13]: 0 1 2 3 4 Marketcap 0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B 1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B 4 2.029370 0.899612 0.261146 1.474148 -1.663970 100.9M

Explanation:

 In [14]: df.Marketcap.replace(['M','B'],['','*1000'], regex=True) Out[14]: 0 1.71*1000 1 1.82*1000 2 1.1 3 30.92 4 100.9 Name: Marketcap, dtype: object In [15]: df.Marketcap.replace(['M','B'],['','*1000'], regex=True).add(' > 35') Out[15]: 0 1.71*1000 > 35 1 1.82*1000 > 35 2 1.1 > 35 3 30.92 > 35 4 100.9 > 35 Name: Marketcap, dtype: object In [16]: pd.eval(df.Marketcap.replace(['M','B'],['','*1000'], regex=True).add(' > 35')) Out[16]: array([True, True, False, False, True], dtype=object)

+2

Maxu May 08 '17 at 10:16

source share

Maxu · Accepted Answer · 2017-05-09T19:25:32+0000

Source DF:

 In [176]: df Out[176]: 0 1 2 3 Market Cap 0 1.707280 0.666952 0.638515 -0.061126 2.291747 1.71B 1 -1.017134 1.353627 0.618433 0.008279 0.148128 1.82B 2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M 3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M 4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k 5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1

Decision:

 to_replace = ['\d+\s*[Kk]','\d+\s*[Mm]','\d+\s*[Bb]', '-1', 'N/A'] value = [1000,1000000,1000000000, 1, 1] mask = df.assign( f=df['Market Cap'].replace(to_replace, value, regex=True), Marketcap=pd.to_numeric(df['Market Cap'].str.replace(r'[^\d\.]', ''), errors='coerce') ).eval("Marketcap * f < 35000000") df[mask]

Result:

 In [178]: df[mask] Out[178]: 0 1 2 3 Market Cap 2 -0.774057 -0.165566 -0.083345 0.741598 -0.139851 1.1M 3 -0.630724 0.250737 1.308556 -1.040799 1.064456 30.92M 4 2.029370 0.899612 0.261146 1.474148 -1.663970 476.74k 5 2.029370 0.899612 0.261146 1.474148 -1.663970 -1

PS, if you want to leave non-numeric values (for example, N/A ) as a result of changing the data set:

 pd.to_numeric(df['Market Cap'].str.replace(r'[^\d\.]', ''), errors='coerce')

to

 pd.to_numeric(df['Market Cap'].str.replace(r'[^\d\.]', ''), errors='coerce').fillna('0')

Custom Boolean filtering in Pandas?

More articles: