How can I remove observations from a data frame conditionally without losing the NA value in R?

There is a variable called YOB data frame. As you can see, there are 333 NA values.

 > summary(train$YOB) Min. 1st Qu. Median Mean 3rd Qu. Max. NA 1880 1970 1983 1980 1993 2039 333 

I have identified some emissions and want to get rid of them. Anything less than 1900 and more than 2003 should be removed. I tried to do this by indexing.

 train = train[which(train$YOB >= 1900 & train$YOB <= 2003),] 

Unfortunately, observations whose YOB variable were NA are also deleted.

 > summary(train$YOB) Min. 1st Qu. Median Mean 3rd Qu. Max. 1900 1970 1983 1980 1993 2003 

On the one hand, I am facing the same problem when using the subset command.

 > train = subset(train, YOB >= 1900 & YOB <= 2003) > summary(train$YOB) Min. 1st Qu. Median Mean 3rd Qu. Max. 1900 1970 1983 1980 1993 2003 

I also tried to use this condition in both attempts, but without success, for example

 > train = train[which(!is.na(train$YOB) & train$YOB >= 1900 & train$YOB <= 2003),] > summary(train$YOB) Min. 1st Qu. Median Mean 3rd Qu. Max. 1900 1970 1983 1980 1993 2003 

I would like to save the observations that have NA in the YOB variable, and delete only those that are numeric. The idea is the second step to bring the missing values.

+5
source share
1 answer

which will provide a numerical index and skip all these NA lines. To avoid this, use a logical index without packaging with which . The index will be NA in this way, and this row will remain NA, even if there are other values ​​that are not NA.

 res1 <- train[train$YOB >= 1900 & train$YOB <= 2003,] res1[is.na(res1$YOB),] # YOB col2 #NA NA NA 

The correct way would be to have another condition with is.na

 res2 <- train[is.na(train$YOB)| (train$YOB >= 1900 & train$YOB <= 2003),] res2[is.na(res2$YOB),] # YOB col2 #42 NA 0.2258094 

Using a simple example

 set.seed(25) d1 <- data.frame(v1 = c(NA, 1, 5), v2 = rnorm(3)) d1$v1 >1 #[1] NA FALSE TRUE 

Here the value of NA remains so. If we use which

 which(d1$v1 >1) #[1] 3 

we get only the index of TRUE values. According to the OP, both NA and rows that satisfy the logical condition must be returned. In this case

 d1[is.na(d1$v1)|d1$v1 > 1,] # v1 v2 #1 NA -0.2118336 #3 5 -1.1533076 

data

 set.seed(29) train <- data.frame(YOB = sample(c(NA, 1850:2015), 100, replace=TRUE), col2 = rnorm(100)) 
+3
source

All Articles