What happens when I modify a DataFrame pandas as follows

Question

What happens when I modify a DataFrame pandas as follows

trying to understand this behavior (why it happens, and if it was intentional, then what was the motivation for it) in this way.

So I create a dataframe

np.random.seed(0) df = pd.DataFrame(np.random.random((4,2))) 0 1 0 0.548814 0.715189 1 0.602763 0.544883 2 0.423655 0.645894 3 0.437587 0.891773

and I can refer to such columns

 df.columns = ['a','b'] df.a 0 0 0.548814 1 0.602763 2 0.423655 3 0.437587

I can even do what I think is a new column

  df.third = pd.DataFrame(np.random.random((4,1)))

but df is still

 df 0 1 0 0.548814 0.715189 1 0.602763 0.544883 2 0.423655 0.645894 3 0.437587 0.891773

however df.third also exists (but I don't see it in my variable viewer in Spyder)

 df.third 0 0 0.118274 1 0.639921 2 0.143353 3 0.944669

If I wanted to add a third column, I would have to do the following

 df['third'] = pd.DataFrame(np.random.random((4,1))) ab third 0 0.548814 0.715189 0.568045 1 0.602763 0.544883 0.925597 2 0.423655 0.645894 0.071036 3 0.437587 0.891773 0.087129

So my question is what happens when I do df.third against df ['third']?

+2

python pandas

Mohammad athar Mar 23 '17 at 14:17

source share

2 answers

I think you are adding the third attribute to the pandas data frame object. If you want to add a column named "third", you should do the following:

 df['third'] = pd.DataFrame(np.random.random((4,1)))

0

ivanicki.ilia Mar 23 '17 at 14:19

source share

Edchum · Accepted Answer · 2017-03-23T14:18:50+0000

Since he added third as an attribute, you should stop accessing columns as an attribute and always use df['third'] to avoid ambiguous behavior.

You should always get used to accessing and assigning columns with df[col_name] to avoid problems like

 df.mean = some_calc()

Well, the problem here is that mean is a method for a DataFrame

So, you have rewritten the method with some calculated value.

The problem is that it was part of the design as a convenience and pandas for a data analysis book, and some early online video presentations showed it as a way to assign a new column, but subtle errors can be such that it really needs to be banned and removed IMO

Seriously, I can't stress this enough, stop referring to columns as an attribute , this is a serious bug and, unfortunately, I still see a lot of answers showing this usage

You can see that no new column has been added:

 In [97]: df.third = pd.DataFrame(np.random.random((4,1))) df.columns Out[97]: Index(['a', 'b'], dtype='object')

You can see that third was added as an attribute:

 In [98]: df.__dict__ Out[98]: {'_data': BlockManager Items: Index(['a', 'b'], dtype='object') Axis 1: Int64Index([0, 1, 2, 3], dtype='int64') FloatBlock: slice(0, 2, 1), 2 x 4, dtype: float64, '_iloc': <pandas.core.indexing._iLocIndexer at 0x7e73b00>, '_item_cache': {}, 'is_copy': None, 'third': 0 0 0.844821 1 0.286501 2 0.459170 3 0.243452}

You can see that you have Items , __data , Axis 1 , etc., but you also have 'third' , which is an attribute

What happens when I modify a DataFrame pandas as follows

More articles: