How to understand axis = 0 or 1 in pandas (Python)?

Question

How to understand axis = 0 or 1 in pandas (Python)?

From the documentation "the first runs vertically down the rows (axis 0), and the second runs horizontally along the columns (axis 1)" And the code

df1 = pd.DataFrame({"x":[1, 2, 3, 4, 5], "y":[3, 4, 5, 6, 7]}, index=['a', 'b', 'c', 'd', 'e']) df2 = pd.DataFrame({"y":[1, 3, 5, 7, 9], "z":[9, 8, 7, 6, 5]}, index=['b', 'c', 'd', 'e', 'f']) pd.concat([df1, df2], join='inner') # by default axis=0

since axis = 0 (which I interpret as a column) I think concat only considers columns found in both data frames. But acutal output considers the lines that are in both frames of the data (the only single element of the string is "y") So, how to correctly understand the axis = 0.1?

+5

python pandas axis

lxdthriller Sep 2 '16 at 2:11

source share

3 answers

Maxu · Answer 1 · 2016-09-02T03:47:10+0000

Data:

 In [55]: df1 Out[55]: xy a 1 3 b 2 4 c 3 5 d 4 6 e 5 7 In [56]: df2 Out[56]: yz b 1 9 c 3 8 d 5 7 e 7 6 f 9 5

Concatenation horizontally (axis = 1) using the index elements found in both DFs (aligned by indexes for the join):

 In [57]: pd.concat([df1, df2], join='inner', axis=1) Out[57]: xyyz b 2 4 1 9 c 3 5 3 8 d 4 6 5 7 e 5 7 7 6

Vertical concatenation (DEFAULT: axis = 0) using the columns found in both DFs:

 In [58]: pd.concat([df1, df2], join='inner') Out[58]: y a 3 b 4 c 5 d 6 e 7 b 1 c 3 d 5 e 7 f 9

If you are not using the inner join method, you will get it like this:

 In [62]: pd.concat([df1, df2]) Out[62]: xyz a 1.0 3 NaN b 2.0 4 NaN c 3.0 5 NaN d 4.0 6 NaN e 5.0 7 NaN b NaN 1 9.0 c NaN 3 8.0 d NaN 5 7.0 e NaN 7 6.0 f NaN 9 5.0 In [63]: pd.concat([df1, df2], axis=1) Out[63]: xyyz a 1.0 3.0 NaN NaN b 2.0 4.0 1.0 9.0 c 3.0 5.0 3.0 8.0 d 4.0 6.0 5.0 7.0 e 5.0 7.0 7.0 6.0 f NaN NaN 9.0 5.0

Boud · Answer 2 · 2016-09-02T02:57:03+0000

Interpret axis = 0 to apply the algorithm to each column or row labels (index). More detailed diagram here .

If you apply this general interpretation to your case, the algorithm here is concat . Thus, for the axis = 0, this means:

for each column, take all the rows down (across all the data frames for concat ) and contact them when they are shared (since you chose join=inner ).

Thus, the point would be to take all the columns of x and combine them into rows, which will stack each piece of rows one by one. However, here x not everywhere, so it is not saved for the final result. The same goes for z . For y result is stored as y in all data frames. This is the result that you have.

Tai · Answer 3 · 2018-01-12T01:46:17+0000

First, the OP misunderstands the rows and columns in its framework.

But acutal output considers rows that are in both data frames (the only common element of the string is 'y')

OP read the y mark for the string. However, y is the name of the column.

 df1 = pd.DataFrame( {"x":[1, 2, 3, 4, 5], # <-- looks like row x but actually col x "y":[3, 4, 5, 6, 7]}, # <-- looks like row y but actually col y index=['a', 'b', 'c', 'd', 'e']) print(df1) \col xy index or row\ a 1 3 | a b 2 4 vx c 3 5 ri d 4 6 os e 5 7 w 0 -> column axis 1

This is very easy to mislead, because in the dictionary it looks like y and x - these are two lines.

If you create df1 from a list of lists, it should be more intuitive:

 df1 = pd.DataFrame([[1,3], [2,4], [3,5], [4,6], [5,7]], index=['a', 'b', 'c', 'd', 'e'], columns=["x", "y"])

So, back to the problem, concat is an abbreviation for concatenate (means connecting in a sequence or chain along this path [source] ) Running concat along the 0 axis means linking two objects along the axis .

  1 1 <-- series 1 1 ^ ^ ^ | | | 1 caa 1 olx 1 noi gives you 2 cns 2 ag 0 2 t | | | VV v 2 2 <--- series 2 2

So ... I think you have a feeling now. What about the sum function in pandas? What does sum(axis=0) mean?

Suppose the data looks like

  1 2 1 2 1 2

Maybe ... summation along the 0 axis, you can guess. Yes!!

 ^ ^ ^ | | | saaulxmoi gives you two values 3 6 ! | nsvg 0 | | VV

How about dropna ? Suppose you have data

  1 2 NaN NaN 3 5 2 4 6

and you want to save

 2 3 4

The documentation says that a Return object with labels on a given axis is omitted, where any or all of the data is alternately missing.

Should you put dropna(axis=0) or dropna(axis=1) ? Think about it and try with

 df = pd.DataFrame([[1, 2, np.nan], [np.nan, 3, 5], [2, 4, 6]]) # df.dropna(axis=0) or df.dropna(axis=1) ?

Hint: think about this word.

How to understand axis = 0 or 1 in pandas (Python)?

More articles: