Data
I am working with a dataset similar to data.frame generated below:
set.seed(1) dta <- data.frame(observation = 1:20, valueA = runif(n = 20), valueB = runif(n = 20), valueC = runif(n = 20), valueD = runif(n = 20)) dta[2:5,3] <- NA dta[2:10,4] <- NA dta[7:20,5] <- NA
Columns have NA values ββwith the last column having more than 60% of NAs observations.
> sapply(dta, function(x) {table(is.na(x))}) $observation FALSE 20 $valueA FALSE 20 $valueB FALSE TRUE 16 4 $valueC FALSE TRUE 11 9 $valueD FALSE TRUE 6 14
Problem
I would like to remove this column in the dplyr row by passing it to the select argument.
Attempts
This is easy to do in base . For example, to select columns with less than 50% NAs , I can do:
dta[, colSums(is.na(dta)) < nrow(dta) / 2]
which produces:
> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2) observation valueA valueB valueC 1 1 0.2655087 0.9347052 0.8209463 2 2 0.3721239 NA NA
Task
I am interested in achieving the same flexibility in the dplyr pipeline dplyr :
Vectorize(require)(package = c("dplyr",
source share