Pandas DataFrame.unstack () Reorders row and column headers

Question

Pandas DataFrame.unstack () Reorders row and column headers

I ran into the following problem of sorting row and column headers.

Here's how to reproduce it:

X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100))) A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor g = X.groupby([A,B])['x'].mean() #do a two-way bucketing print g #this gives the following and so far so good xy [-2.315, -0.843] [-2.58, -0.567] -1.041167 (-0.567, 0.0321] -1.722926 (0.0321, 0.724] -1.245856 (0.724, 3.478] -1.240876 (-0.843, -0.228] [-2.58, -0.567] -0.576264 (-0.567, 0.0321] -0.501709 (0.0321, 0.724] -0.522697 (0.724, 3.478] -0.506259 (-0.228, 0.382] [-2.58, -0.567] 0.175768 (-0.567, 0.0321] 0.214353 (0.0321, 0.724] 0.113650 (0.724, 3.478] -0.013758 (0.382, 2.662] [-2.58, -0.567] 0.983807 (-0.567, 0.0321] 1.214640 (0.0321, 0.724] 0.808608 (0.724, 3.478] 1.515334 Name: x, dtype: float64 #Now let make a two way table and here is the problem: HTML(g.unstack().to_html())

It shows:

 y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] [-2.58, -0.567] x (-0.228, 0.382] 0.214353 0.113650 -0.013758 0.175768 (-0.843, -0.228] -0.501709 -0.522697 -0.506259 -0.576264 (0.382, 2.662] 1.214640 0.808608 1.515334 0.983807 [-2.315, -0.843] -1.722926 -1.245856 -1.240876 -1.041167

Note how headers are no longer sorted. I am wondering if this is a good way to solve this problem to make interactive work easy.

To further track the problem, follow these steps:

 g.unstack().columns

This gives me the following: Index ([(- 0.567, 0.0321], (0.0321, 0.724), (0.724, 3.478), [-2.58, -0.567]], dtype = object)

Now compare this to B.levels:

 B.levels Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object)

Obviously, the original source code is lost.

Now, to make matters worse, make a multi-level crosstab:

 g2 = X.groupby([A,B]).agg('mean') g3 = g2.stack().unstack(-2) HTML(g3.to_html())

It shows something like:

 y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] x (-0.228, 0.382] x 0.214353 0.113650 -0.013758 y -0.293465 0.321836 1.180369 (-0.843, -0.228] x -0.501709 -0.522697 -0.506259 y -0.204811 0.324571 1.167005 (0.382, 2.662] x 1.214640 0.808608 1.515334 y -0.195446 0.161198 1.074532 [-2.315, -0.843] x -1.722926 -1.245856 -1.240876 y -0.392896 0.335471 1.730513

Both row and column labels are not sorted correctly.

Thanks.

+8

python pandas

Tom bennett Jun 17 '13 at 20:54

source share

1 answer

Andy hayden · Accepted Answer · 2013-06-17T21:23:08+0000

This seems a bit hacky, but here goes:

 In [11]: g_unstacked = g.unstack() In [12]: g_unstacked Out[12]: y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565] x (-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337 (-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180 (0.625, 1.639] 0.987372 1.006496 0.830710 1.202158 [-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679

Using the fact that unique preserves the order * (capturing the unique first members from the g index):

 In [13]: g.index.get_level_values(0).unique() Out[13]: array(['[-3.124, -0.892]', '(-0.892, -0.068]', '(-0.068, 0.625]', '(0.625, 1.639]'], dtype=object)

As you can see, they are in the correct order.

Now you can reindex as follows:

 In [14]: g_unstacked.reindex(g.index.get_level_values(0).unique()) Out[14]: y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565] [-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679 (-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180 (-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337 (0.625, 1.639] 0.987372 1.006496 0.830710 1.202158

What is now in the correct order.

Update (I missed that the columns are also out of order).
You can use the same trick for columns (you will need to link these operations):

 In [15]: g_unstacked.reindex_axis(g.index.get_level_values(1).unique(), axis=1)

* This is the reason why the unique series is significantly faster than np.unique .

Pandas DataFrame.unstack () Reorders row and column headers

More articles: