For each row, what is the fastest way to find a column containing the nth element that is not NaN?

I have a Python pandas DataFrame in which each element is a float or NaN. For each row, I will need to find a column that contains the nth row number. That is, I need to get a column containing the nth element of the row, which is not NaN. I know that the nth such column always exists.

So, if n is 4, and the pandas framework called myDF was as follows:

10 20 30 40 50 60 70 80 90 100 'A' 4.5 5.5 2.5 NaN NaN 2.9 NaN NaN 1.1 1.8 'B' 4.7 4.1 NaN NaN NaN 2.0 1.2 NaN NaN NaN 'C' NaN NaN NaN NaN NaN 1.9 9.2 NaN 4.4 2.1 'D' 1.1 2.2 3.5 3.4 4.5 NaN NaN NaN 1.9 5.5 

I would like to get:

 'A' 60 'B' 70 'C' 100 'D' 40 

I could do:

 import pandas as pd import math n = some arbitrary int for row in myDF.indexes: num_not_NaN = 0 for c in myDF.columns: if math.isnan(myDF[c][row]) == False: num_not_NaN +=1 if num_not_NaN==n: print row, c break 

I am sure it is very slow and not very pythonic. Is there an approach that will be faster if I deal with a very large DataFrame and large n values?

+6
source share
3 answers

If your goal is speed, it is recommended that you use all of Pandas' methods if you can:

 >>> (df.notnull().cumsum(axis=1) == 4).idxmax(axis=1) # replace 4 with any number you like 'A' 60 'B' 70 'C' 100 'D' 40 dtype: object 

The other answers are good and maybe a little clearer syntactically. In terms of speed, there is not much difference between the two for your small example. However, for a slightly larger DataFrame, the vectorized method is already about 60 times faster:

 >>> df2 = pd.concat([df]*1000) # 4000 row DataFrame >>> %timeit df2.apply(lambda row: get_nth(row, n), axis=1) 1 loops, best of 3: 749 ms per loop >>> %timeit df2.T.apply(lambda x: x.dropna()[n-1:].index[0]) 1 loops, best of 3: 673 ms per loop >>> %timeit (df2.notnull().cumsum(1) == 4).idxmax(axis=1) 100 loops, best of 3: 10.5 ms per loop 
+5
source

You can create a function and then pass it to the lambda function.

The function filters the series for zeros and then returns the index value of the element n (or None if the index length is less than n ).

The lambda function requires axis=1 to ensure that it applies to every row of the DataFrame.

 def get_nth(series, n): s = series[series.notnull()] if len(s) >= n: return s.index[n - 1] >>> n = 4 >>> df.apply(lambda row: get_nth(row, n), axis=1) A 60 B 70 C 100 D 40 dtype: object 
+2
source

You can transfer df and apply a lambda that flushes NaN strings, slices from the 4th value and then returns the first valid index:

 In [72]: n=4 df.T.apply(lambda x: x.dropna()[n-1:].index[0]) Out[72]: 'A' 60 'B' 70 'C' 100 'D' 40 dtype: object 
+2
source

All Articles