Expressions with "== True" and "True" give different results.

I have a MCVE :

#!/usr/bin/env python3 import pandas as pd df = pd.DataFrame([True, False, True]) print("Whole DataFrame:") print(df) print("\nFiltered DataFrame:") print(df[df[0] == True]) 

The conclusion is the one I expected:

 Whole DataFrame: 0 0 True 1 False 2 True Filtered DataFrame: 0 0 True 2 True 

Ok, but the PEP8 style seems wrong, it says: comparing E712 with True should be if cond is True or if cond , So I changed it to is True instead of == True , but now it fails, the output is:

 Whole DataFrame: 0 0 True 1 False 2 True Filtered DataFrame: 0 True 1 False 2 True Name: 0, dtype: bool 

What's happening?

+6
source share
4 answers

The catch here is that in df[df[0] == True] you are not comparing objects with True .

As the other answers say, == overloaded in pandas to create a Series instead of a bool , as is usually the case. [] also overloaded to interpret the Series and give a filter result. The code is essentially equivalent to:

 series = df[0].__eq__(True) df.__getitem__(series) 

So you do not violate PEP8 by leaving == here.


Essentially, pandas provides familiar syntactic unusual semantics - that's what caused the confusion.

"Accoring to Stroustroup" (section .3.3.3), operator overloading has caused problems because of this since its invention (and he was thinking about whether to include it in C ++). Seeing even more abuse of it in C ++ , Gosling ran to the other extreme in Java, completely banning it, and this turned out to be just that extreme.

As a conclusion, modern languages ​​and code tend to overload operators, but be careful not to abuse it, and that the semantics remain consistent.

+2
source

In python, is checks if an object matches another. == defined as pandas.Series for actions on an element, is not.

Because of this, df[0] is True compares if df[0] and True are the same object. The result is False , which in turn is 0 , so you get 0 columns when you execute df[df[0] is True]

+5
source

I think in pandas comparison only works with == , and the result is boolean Series . On exit is False . More on there .

 print df[0] == True 0 True 1 False 2 True Name: 0, dtype: bool print df[df[0]] 0 0 True 2 True print df[df[0] == True] 0 0 True 2 True print df[0] is True False print df[df[0] is True] 0 True 1 False 2 True Name: 0, dtype: bool 
+2
source

This is an explanation for MaxNoe's answer, as it was lengthy to include in the comments.

As he pointed out, df[0] is True evaluates to False , which is then forced to 0 , which matches the column name. What is interesting about this? what if you run

 >>>df = pd.DataFrame([True, False, True]) >>>df[False] KeyError Traceback (most recent call last) <ipython-input-21-62b48754461f> in <module>() ----> 1 df[False] >>>df[0] 0 True 1 False 2 True Name: 0, dtype: bool >>>df[False] 0 True 1 False 2 True Name: 0, dtype: bool 

This seems a little perplexing at first (at least for me), but has to do with how pandas uses caching. If you look at how df[False] resolved, it looks like

  /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/frame.py(1975)__getitem__() -> return self._getitem_column(key) /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/frame.py(1999)_getitem_column() -> return self._get_item_cache(key) > /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/generic.py(1343)_get_item_cache() -> res = cache.get(item) 

Since cache is just a regular dict python, after running df[0] cache looks like

 >>>cache {0: 0 True 1 False 2 True Name: 0, dtype: bool} 

so when searching for False , python forces this to 0 . If we don’t have already loaded the cache using df[0] , then res is None , which KeyError on line 1345 generic.py

 def _get_item_cache(self, item): 1341 """Return the cached item, item represents a label indexer.""" 1342 cache = self._item_cache 1343 -> res = cache.get(item) 1344 if res is None: 1345 values = self._data.get(item) 
+2
source

All Articles