Folding lines where some are all NA, others do not intersect with some NA

I have a simple frame as such:

ID Col1 Col2 Col3 Col4 1 NA NA NA NA 1 5 10 NA NA 1 NA NA 15 20 2 NA NA NA NA 2 25 30 NA NA 2 NA NA 35 40 

And I would like to reformat it as such:

 ID Col1 Col2 Col3 Col4 1 5 10 15 20 2 25 30 35 40 

(note: the real data set contains thousands of rows, and the values ​​are from biological data; NA does not follow a simple pattern, except that NA do not intersect, and yes there are exactly 3 rows for each ID ).

STEP ONE : Get rid of strings that have only NA values.

At first glance, it looked simple, but I ran into some problems.

complete.cases(DF) returns all FALSE , so I cannot use this to delete rows with all NA s, as in DF[complete.cases(DF),] . This is because all lines contain at least one NA .

Since is.na want to multiply, other schemes using is.na come out for the same reason.

STEP TWO strong>: Collapse the remaining two lines into one.

Thinking of using something like aggregate to do this, but there is an easier way than this that doesn't work at all.

Thanks for any advice.

+9
r aggregate na
source share
4 answers

Try

 library(dplyr) DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE))) 

Edit: to take into account the case when one column has all NAs for a certain ID , we need sum_NA() which returns NA if all are NAs

 txt <- "ID Col1 Col2 Col3 Col4 1 NA NA NA NA 1 5 10 NA NA 1 NA NA 15 20 2 NA NA NA NA 2 NA 30 NA NA 2 NA NA 35 40" DF <- read.table(text = txt, header = TRUE) # original code DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE))) # 'summarise_each()' is deprecated. # Use 'summarise_all()', 'summarise_at()' or 'summarise_if()' instead. # To map 'funs' over all variables, use 'summarise_all()' # A tibble: 2 x 5 ID Col1 Col2 Col3 Col4 <int> <int> <int> <int> <int> 1 1 5 10 15 20 2 2 0 30 35 40 sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)} DF %>% group_by(ID) %>% summarise_all(funs(sum_NA)) DF %>% group_by(ID) %>% summarise_if(is.numeric, funs(sum_NA)) # A tibble: 2 x 5 ID Col1 Col2 Col3 Col4 <int> <int> <int> <int> <int> 1 1 5 10 15 20 2 2 NA 30 35 40 
+8
source share

Here's a data table approach that uses na.omit() for columns grouped by id.

 library(data.table) setDT(df)[, lapply(.SD, na.omit), by = ID] # ID Col1 Col2 Col3 Col4 # 1: 1 5 10 15 20 # 2: 2 25 30 35 40 
+14
source share

Here are some aggregate attempts:

 aggregate(. ~ ID, data=dat, FUN=na.omit, na.action="na.pass") # ID Col1 Col2 Col3 Col4 #1 1 5 10 15 20 #2 2 25 30 35 40 

Since the aggregate formula interface uses na.omit by default for all data before performing any grouping, it deletes each dat row, since they all contain at least one NA value. Try it: nrow(na.omit(dat)) returns 0 . Therefore, in this case, use na.pass in the aggregate and then na.omit to skip the NA that were passed.

Alternatively, do not use the formula interface and specify the columns for manual aggregation:

 aggregate(dat[-1], dat[1], FUN=na.omit ) aggregate(dat[c("Col1","Col2","Col3","Col4")], dat["ID"], FUN=na.omit) # ID Col1 Col2 Col3 Col4 #1 1 5 10 15 20 #2 2 25 30 35 40 
+6
source share

simple way:

 as.data.frame(lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)[!is.na(x)])) 

but if not all columns have the same number of non- NA values, then you need to crop them like this:

 temp <- lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)x[!is.na(x)]) len <- min(sapply(temp,length)) as.data.frame(lapply(temp,`[`,seq(len))) 
+1
source share

All Articles