Get the largest values ​​from each pandas.DataFrame column

Here is my pandas.DataFrame :

 import pandas as pd data = pd.DataFrame({ 'first': [40, 32, 56, 12, 89], 'second': [13, 45, 76, 19, 45], 'third': [98, 56, 87, 12, 67] }, index = ['first', 'second', 'third', 'fourth', 'fifth']) 

I want to create a new DataFrame that will contain the top 3 values ​​from each column of my data DataFrame .

Here is the expected result:

  first second third 0 89 76 98 1 56 45 87 2 40 45 67 

How can i do this?

+8
python pandas dataframe
source share
5 answers

Create a function to return the top three values ​​of a series:

 def sorted(s, num): tmp = s.sort_values(ascending=False)[:num] # earlier s.order(..) tmp.index = range(num) return tmp 

Apply it to your dataset:

 In [1]: data.apply(lambda x: sorted(x, 3)) Out[1]: first second third 0 89 76 98 1 56 45 87 2 40 45 67 
+9
source share

With numpy, you can get an array of top-3 values ​​along the columns, for example:

 >>> import numpy as np >>> col_ind = np.argsort(data.values, axis=0)[::-1,:] >>> ind_to_take = col_ind[:3,:] + np.arange(data.shape[1])*data.shape[0] >>> np.take(data.values.T, ind_to_take) array([[89, 76, 98], [56, 45, 87], [40, 45, 67]], dtype=int64) 

You can convert back to a DataFrame:

 >>> pd.DataFrame(_, columns = data.columns, index=data.index[:3]) first second third One 89 76 98 Two 56 45 87 Three 40 45 67 
+3
source share

Other solutions (while writing this), sort a DataFrame with super-linear complexity per column, but this can be done with linear time on a column.

firstly, numpy.partition splits the k smallest elements into k first positions (otherwise unsorted). To get the k largest elements, we can use

 import numpy as np -np.partition(-v, k)[: k] 

Combining this with understanding the dictionary, we can use:

 >>> pd.DataFrame({c: -np.partition(-data[c], 3)[: 3] for c in data.columns}) first second third 0 89 76 98 1 56 45 87 2 40 45 67 
+1
source share

Alternative pandas solution:

 In [6]: N = 3 In [7]: pd.DataFrame([df[c].nlargest(N).values.tolist() for c in df.columns], ...: index=df.columns, ...: columns=['{}_largest'.format(i) for i in range(1, N+1)]).T ...: Out[7]: first second third 1_largest 89 76 98 2_largest 56 45 87 3_largest 40 45 67 
0
source share

Use nlargest as

 In [1594]: pd.DataFrame({c: data[c].nlargest(3).values for c in data}) Out[1594]: first second third 0 89 76 98 1 56 45 87 2 40 45 67 

<sub> where_sub>

 In [1603]: data Out[1603]: first second third first 40 13 98 second 32 45 56 third 56 76 87 fourth 12 19 12 fifth 89 45 67 
0
source share

All Articles