Conditionally choosing columns in dplyr, where a certain fraction of the values ​​is equal to NA

Data

I am working with a dataset similar to data.frame generated below:

 set.seed(1) dta <- data.frame(observation = 1:20, valueA = runif(n = 20), valueB = runif(n = 20), valueC = runif(n = 20), valueD = runif(n = 20)) dta[2:5,3] <- NA dta[2:10,4] <- NA dta[7:20,5] <- NA 

Columns have NA values ​​with the last column having more than 60% of NAs observations.

 > sapply(dta, function(x) {table(is.na(x))}) $observation FALSE 20 $valueA FALSE 20 $valueB FALSE TRUE 16 4 $valueC FALSE TRUE 11 9 $valueD FALSE TRUE 6 14 

Problem

I would like to remove this column in the dplyr row by passing it to the select argument.

Attempts

This is easy to do in base . For example, to select columns with less than 50% NAs , I can do:

 dta[, colSums(is.na(dta)) < nrow(dta) / 2] 

which produces:

 > head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2) observation valueA valueB valueC 1 1 0.2655087 0.9347052 0.8209463 2 2 0.3721239 NA NA 

Task

I am interested in achieving the same flexibility in the dplyr pipeline dplyr :

 Vectorize(require)(package = c("dplyr", # Data manipulation "magrittr"), # Reverse pipe char = TRUE) dta %<>% # Some transformations I'm doing on the data mutate_each(funs(as.numeric)) %>% # I want my select to take place here 
+6
source share
2 answers

How is this possible?

 dta %>% select(which(colMeans(is.na(.)) < 0.5)) %>% head # observation valueA valueB valueC #1 1 0.2655087 0.9347052 0.8209463 #2 2 0.3721239 NA NA #3 3 0.5728534 NA NA #4 4 0.9082078 NA NA #5 5 0.2016819 NA NA #6 6 0.8983897 0.3861141 NA 

Updated with colMeans instead of colSums , which means you no longer need to divide by the number of rows.

And, just for the record, in the R database you can also use colMeans :

 dta[,colMeans(is.na(dta)) < 0.5] 
+10
source

We can use extract from magrittr after getting a logical vector with summarise_each/unlist

 library(magrittr) library(dplyr) dta %>% summarise_each(funs(sum(is.na(.)) < n()/2)) %>% unlist() %>% extract(dta,.) 

Or use Filter from base R

 Filter(function(x) sum(is.na(x)) < length(x)/2, dta) 

Or a little compact option

 Filter(function(x) mean(is.na(x)) < 0.5, dta) 
+2
source

All Articles