Removing columns from data frame where ALL values ​​are NA

I am having problems with the data frame and I cannot solve this problem myself:
The dataframe has arbitrary column properties and each row represents one data set. .

Question:
How to get rid of columns where for ALL rows the value is NA ?

+108
r dataframe apply
Apr 15 '10 at 8:59
source share
8 answers

Try the following:

df <- df[,colSums(is.na(df))<nrow(df)] 
+125
Apr 15 '10 at 9:12
source share

The two proposed approaches cannot cope with large data sets, because (among other memory problems) they create is.na(df) , which will have the same size as df .

Here are two approaches that are more effective in terms of memory and time.

Filter Approach

 Filter(function(x)!all(is.na(x)), df) 

and data table approach (for total time and memory efficiency)

 library(data.table) DT <- as.data.table(df) DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F] 

examples using big data (30 columns, 1e6 rows)

 big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F) bd <- do.call(data.frame,big_data) names(bd) <- paste0('X',seq_len(30)) DT <- as.data.table(bd) system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]}) # error -- can't allocate vector of size ... system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]}) # error -- can't allocate vector of size ... system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)}) ## user system elapsed ## 0.26 0.03 0.29 system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]}) ## user system elapsed ## 0.14 0.03 0.18 
+79
Sep 27
source share

dplyr now has the verb select_if which can be useful here:

 library(dplyr) temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5)) not_all_na <- function(x) any(!is.na(x)) not_any_na <- function(x) all(!is.na(x)) > temp xyz 1 1 1 NA 2 2 2 NA 3 3 NA NA 4 4 4 NA 5 5 5 NA > temp %>% select_if(not_all_na) xy 1 1 1 2 2 2 3 3 NA 4 4 4 5 5 5 > temp %>% select_if(not_any_na) x 1 1 2 2 3 3 4 4 5 5 
+24
May 14 '18 at 16:40
source share

Another way is to use the apply() function.

If you have data.frame

 df <- data.frame (var1 = c(1:7,NA), var2 = c(1,2,1,3,4,NA,NA,9), var3 = c(NA) ) 

then you can use apply() to see which columns match your condition, and so you can just do the same subset as in Musa's answer, only with the apply approach.

 > !apply (is.na(df), 2, all) var1 var2 var3 TRUE TRUE FALSE > df[, !apply(is.na(df), 2, all)] var1 var2 1 1 1 2 2 2 3 3 1 4 4 3 5 5 4 6 6 NA 7 7 NA 8 NA 9 
+14
Apr 15 2018-10-15T00:
source share
 df[sapply(df, function(x) all(is.na(x)))] <- NULL 
+5
Apr 13 '17 at 19:53 on
source share

The accepted answer does not work with non-numeric columns. From this answer, the following works with columns containing different data types

 Filter(function(x) !all(is.na(x)), df) 
+2
Nov 16 '18 at 3:34
source share

Hope this helps too. This can be done in one command, but it was easier for me to read, dividing it into two teams. I made a function with the following statement and worked quickly quickly.

naColsRemoval = function (DataTable) { na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )] DataTable [ , unlist (na.cols) := NULL , with = F] }

.SD will allow you to limit validation to part of the table if you want, but it will accept the entire table as

+1
Jul 21 '15 at 12:57
source share

Late to the game, but you can also use the janitor package. This function will delete all columns that are NA, and can be modified to delete rows that are also NA.

df <- janitor::remove_empty(df, which = "cols")

0
May 14, '19 at 21:48
source share



All Articles