How to remove rows from a data frame that contain n * NA

Question

How to remove rows from a data frame that contain n * NA

I have several large datasets with ~ 10 columns and ~ 200000 rows. Not all columns contain values for each row, although at least one column must contain a value for the row that should be present, I would like to set a threshold value for the number of NA in the row.

My Dataframe looks something like this:

  ID qrstuvwxyz A 1 5 NA 3 8 9 NA 8 6 4 B 5 NA 4 6 1 9 7 4 9 3 C NA 9 4 NA 4 8 4 NA 5 NA D 2 2 6 8 4 NA 3 7 1 32

And I would like to be able to delete rows containing more than two cells containing NA to get

 ID qrstuvwxyz A 1 5 NA 3 8 9 NA 8 6 4 B 5 NA 4 6 1 9 7 4 9 3 D 2 2 6 8 4 NA 3 7 1 32

complete.cases deletes all rows containing any NA , and I know that it is possible to delete rows containing NA in certain columns, but there is a way to change it so that it is non-specific, which columns contains NA , but how much of the total?

Alternatively, this framework is generated by merging multiple data frames using

  file1<-read.delim("~/file1.txt") file2<-read.delim(file=args[1]) file1<-merge(file1,file2,by="chr.pos",all=TRUE)

Perhaps the merge function could be changed?

thanks

+7

merge filter r rows na

user2662708 Aug 08 '13 at 1:03

source share

4 answers

Hugh · Answer 1 · 2013-08-08T01:25:05+0000

Use rowSums . To remove rows from a data frame ( df ) that contain exactly n NA values:

 df <- df[rowSums(is.na(df)) != n, ]

or delete rows containing n or more NA values:

 df <- df[rowSums(is.na(df)) < n, ]

in both cases, of course, replacing n with a number requiring

Ricardo saporta · Answer 2 · 2013-08-08T01:28:27+0000

If dat is the name of your data.frame, then the following returns what you are looking for:

 keep <- rowSums(is.na(dat)) < 2 dat <- dat[keep, ]

What does it do:

 is.na(dat) # returns a matrix of T/F # note that when adding logicals # T == 1, and F == 0 rowSums(.) # quickly computes the total per row # since your task is to identify the # rows with a certain number of NA rowSums(.) < 2 # for each row, determine if the sum # (which is the number of NAs) is less # than 2 or not. Returns T/F accordingly

We use the output of this last statement to determine which lines to keep. Note that there is no need to actually keep this last logical.

Blue magister · Answer 3 · 2013-08-08T01:25:08+0000

If d is your data frame, try the following:

 d <- d[rowSums(is.na(d)) < 2,]

42- · Answer 4 · 2013-08-08T01:24:15+0000

This will return a dataset with no more than two values for each row:

 dfrm[ apply(dfrm, 1, function(r) sum(is.na(x)) <= 2 ) , ]

How to remove rows from a data frame that contain n * NA

What does it do:

More articles: