Pandas DataFrame.unstack () Reorders row and column headers

I ran into the following problem of sorting row and column headers.

Here's how to reproduce it:

X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100))) A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor g = X.groupby([A,B])['x'].mean() #do a two-way bucketing print g #this gives the following and so far so good xy [-2.315, -0.843] [-2.58, -0.567] -1.041167 (-0.567, 0.0321] -1.722926 (0.0321, 0.724] -1.245856 (0.724, 3.478] -1.240876 (-0.843, -0.228] [-2.58, -0.567] -0.576264 (-0.567, 0.0321] -0.501709 (0.0321, 0.724] -0.522697 (0.724, 3.478] -0.506259 (-0.228, 0.382] [-2.58, -0.567] 0.175768 (-0.567, 0.0321] 0.214353 (0.0321, 0.724] 0.113650 (0.724, 3.478] -0.013758 (0.382, 2.662] [-2.58, -0.567] 0.983807 (-0.567, 0.0321] 1.214640 (0.0321, 0.724] 0.808608 (0.724, 3.478] 1.515334 Name: x, dtype: float64 #Now let make a two way table and here is the problem: HTML(g.unstack().to_html()) 

It shows:

 y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] [-2.58, -0.567] x (-0.228, 0.382] 0.214353 0.113650 -0.013758 0.175768 (-0.843, -0.228] -0.501709 -0.522697 -0.506259 -0.576264 (0.382, 2.662] 1.214640 0.808608 1.515334 0.983807 [-2.315, -0.843] -1.722926 -1.245856 -1.240876 -1.041167 

Note how headers are no longer sorted. I am wondering if this is a good way to solve this problem to make interactive work easy.

To further track the problem, follow these steps:

 g.unstack().columns 

This gives me the following: Index ([(- 0.567, 0.0321], (0.0321, 0.724), (0.724, 3.478), [-2.58, -0.567]], dtype = object)

Now compare this to B.levels:

 B.levels Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object) 

Obviously, the original source code is lost.

Now, to make matters worse, make a multi-level crosstab:

 g2 = X.groupby([A,B]).agg('mean') g3 = g2.stack().unstack(-2) HTML(g3.to_html()) 

It shows something like:

 y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] x (-0.228, 0.382] x 0.214353 0.113650 -0.013758 y -0.293465 0.321836 1.180369 (-0.843, -0.228] x -0.501709 -0.522697 -0.506259 y -0.204811 0.324571 1.167005 (0.382, 2.662] x 1.214640 0.808608 1.515334 y -0.195446 0.161198 1.074532 [-2.315, -0.843] x -1.722926 -1.245856 -1.240876 y -0.392896 0.335471 1.730513 

Both row and column labels are not sorted correctly.

Thanks.

+8
python pandas
source share
1 answer

This seems a bit hacky, but here goes:

 In [11]: g_unstacked = g.unstack() In [12]: g_unstacked Out[12]: y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565] x (-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337 (-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180 (0.625, 1.639] 0.987372 1.006496 0.830710 1.202158 [-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679 

Using the fact that unique preserves the order * (capturing the unique first members from the g index):

 In [13]: g.index.get_level_values(0).unique() Out[13]: array(['[-3.124, -0.892]', '(-0.892, -0.068]', '(-0.068, 0.625]', '(0.625, 1.639]'], dtype=object) 

As you can see, they are in the correct order.

Now you can reindex as follows:

 In [14]: g_unstacked.reindex(g.index.get_level_values(0).unique()) Out[14]: y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565] [-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679 (-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180 (-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337 (0.625, 1.639] 0.987372 1.006496 0.830710 1.202158 

What is now in the correct order.

Update (I missed that the columns are also out of order).
You can use the same trick for columns (you will need to link these operations):

 In [15]: g_unstacked.reindex_axis(g.index.get_level_values(1).unique(), axis=1) 

* This is the reason why the unique series is significantly faster than np.unique .

+7
source share

All Articles