Select rows where at least one value from the column list is not zero

I have a large framework with lots of columns (e.g. 1000). I have a list of columns (generated script ~ 10). And I would like to highlight all the rows in the original data frame, where at least one of my columns is not null.

So, if I knew the number of my columns in advance, I could do something like this:

list_of_cols = ['col1', ...] df[ df[list_of_cols[0]].notnull() | df[list_of_cols[1]].notnull() | ... df[list_of_cols[6]].notnull() | ] 

I can also iterate over the column list and create a mask that I would apply to df , but its appearance is too tedious. Knowing how powerful pandas is with respect to working with nan, I would expect that there is a way to simplify the path to what I want.

+5
source share
3 answers

Use the thresh parameter in the dropna() method. By setting thresh=1 , you indicate that if there is at least one nonzero element, do not discard it.

 df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7))) list_of_cols = list(range(10)) df[list_of_cols].dropna(thresh=1).head() 

enter image description here

+2
source

Starting from this:

 data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,], 'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7], 'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1], 'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]} df = pd.DataFrame(data, columns=['a','b','c','d']) df abcd 0 NaN NaN NaN NaN 1 0.0 NaN NaN NaN 2 0.0 1.0 1.0 7.0 3 0.0 1.0 1.0 9.0 4 0.0 1.0 2.0 6.0 5 0.0 1.0 2.0 9.0 6 NaN 1.0 3.0 7.0 7 0.0 1.0 3.0 NaN 8 0.0 1.0 3.0 6.0 9 0.0 2.0 1.0 6.0 10 0.0 2.0 1.0 7.0 11 0.0 2.0 1.0 6.0 12 9.0 1.0 1.0 9.0 13 9.0 7.0 1.0 6.0 

Strings where not all values ​​are zeros. (Removing row index 0)

 df[~df.isnull().all(axis=1)] abcd 1 0.0 NaN NaN NaN 2 0.0 1.0 1.0 7.0 3 0.0 1.0 1.0 9.0 4 0.0 1.0 2.0 6.0 5 0.0 1.0 2.0 9.0 6 NaN 1.0 3.0 7.0 7 0.0 1.0 3.0 NaN 8 0.0 1.0 3.0 6.0 9 0.0 2.0 1.0 6.0 10 0.0 2.0 1.0 7.0 11 0.0 2.0 1.0 6.0 12 9.0 1.0 1.0 9.0 13 9.0 7.0 1.0 6.0 
+1
source

You can use logical indexing

 df[~pd.isnull(df[list_of_cols]).all(axis=1)] 

Explanation:

The expression df[list_of_cols]).all(axis=1) returns a logical array that is used as a filter for the data frame:

  • isnull() applied to df[list_of_cols] creates a logical mask for dataframe df[list_of_cols] with True values ​​for null elements in df[list_of_cols] , False otherwise

  • all() returns True if all elements are True (row-wise axis=1 )

Thus, by negation, ~ (not all null = at least one is not equal to null) gets a mask for all rows that contain at least one non-zero element in this column list.

Example:

Dataframe:

 >>> df=pd.DataFrame({'A':[11,22,33,np.NaN], 'B':['x',np.NaN,np.NaN,'w'], 'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']}) >>> df ABC 0 11 x 2016-03-13 1 22 NaN NaN 2 33 NaN 2016-03-14 3 NaN w 2016-03-15 
Mask

isnull :

 >>> ~pd.isnull(df[list_of_cols]) BC 0 True True 1 False False 2 False True 3 True True 

apply all(axis=1) line by line:

 >>> ~pd.isnull(df[list_of_cols]).all(axis=1) 0 True 1 False 2 True 3 True dtype: bool 

Logical selection from the data frame:

 >>> df[~pd.isnull(df[list_of_cols]).all(axis=1)] ABC 0 11 x 2016-03-13 2 33 NaN 2016-03-14 3 NaN w 2016-03-15 
0
source

All Articles