Why does Pandas iterate over the default DataFrame columns?

Trying to understand the constructive rationale for some of the features of Pandas.

If I have a DataFrame with 3560 rows and 18 columns, then

len(frame) 

is 3560 but

 len([a for a in frame]) 

equal to 18.

Maybe this seems natural to someone from R; for me it’s not very "Pythonic". Is there any idea of ​​basic design justifications for Pandas somewhere?

+8
pandas
source share
2 answers

A DataFrame is, above all, a column-based data structure. Under the hood, the data inside the DataFrame is stored in blocks. Roughly speaking, there is one block for each type. Each column has one type of dtype. Thus, access to a column can be made by selecting the corresponding column from one block. On the contrary, selecting one row requires selecting the corresponding row from each block, and then forming a new series and copying data from each row of the block into a series. Thus, iterating through rows of a DataFrame (under the hood) is not a natural process, like iterating through columns.

If you need to iterate through the lines, you can still by calling df.iterrows() . You should avoid using df.iterrows , if possible, for the same reason why it is unnatural - this requires copying, which makes the process slower than iterating through the columns.

+15
source share

There is a worthy explanation in docs - the iteration for Pandas DataFrames should be "dict-like", so iterate over the keys (columns).

It may be a bit confusing that the iteration for Series is greater than the values, but as the docs note, this is because they are more "array-like".

+4
source share

All Articles