Impute data with environment by date in R

I need to replace the missing values ​​in the "steps" field with the median "steps" calculated for this particular day (group "by date") with the NA values ​​removed. I already mentioned this thread , but my NA values ​​are not replaced. Can someone help me find out where I'm going wrong? I would prefer to use the base package / data table / plyr. The data set looks ok. like this: -

steps date interval 1: NA 2012-10-01 0 2: NA 2012-10-01 5 3: NA 2012-10-01 10 4: NA 2012-10-01 15 5: NA 2012-10-01 20 --- 17564: NA 2012-11-30 2335 17565: NA 2012-11-30 2340 17566: NA 2012-11-30 2345 17567: NA 2012-11-30 2350 17568: NA 2012-11-30 2355 

The structure and summary of the data set (activity) is given below.

  #str(activity) Classes 'data.table' and 'data.frame': 17568 obs. of 3 variables: $ steps : int NA NA NA NA NA NA NA NA NA NA ... $ date : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" ... $ interval: int 0 5 10 15 20 25 30 35 40 45 ... #summary(activity) steps date interval Min. : 0.00 Min. :2012-10-01 Min. : 0.0 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8 Median : 0.00 Median :2012-10-31 Median :1177.5 Mean : 37.38 Mean :2012-10-31 Mean :1177.5 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2 Max. :806.00 Max. :2012-11-30 Max. :2355.0 NA :2304 

Things I tried:

Datatable method:

 activityrepNA<-activity[,steps := ifelse(is.na(steps), median(steps, na.rm=TRUE), steps), by=date] summary(activityrepNA) steps date interval Min. : 0.00 Min. :2012-10-01 Min. : 0.0 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8 Median : 0.00 Median :2012-10-31 Median :1177.5 Mean : 37.38 Mean :2012-10-31 Mean :1177.5 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2 Max. :806.00 Max. :2012-11-30 Max. :2355.0 NA :2304 

Using ave

 activity$steps[is.na(activity$steps)] <- with(activity, ave(steps,date, FUN = function(x) median(x, na.rm = TRUE)))[is.na(activity$steps)] > summary(activity) steps date interval Min. : 0.00 Min. :2012-10-01 Min. : 0.0 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8 Median : 0.00 Median :2012-10-31 Median :1177.5 Mean : 37.38 Mean :2012-10-31 Mean :1177.5 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2 Max. :806.00 Max. :2012-11-30 Max. :2355.0 NA :2304 

Attempting to execute ddply

 cleandatapls<-ddply(activity, + .(as.character(date)), + transform, + steps=ifelse(is.na(steps), median(steps, na.rm=TRUE), steps)) > summary(cleandatapls) as.character(date) steps date interval Length:17568 Min. : 0.00 Min. :2012-10-01 Min. : 0.0 Class :character 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8 Mode :character Median : 0.00 Median :2012-10-31 Median :1177.5 Mean : 37.38 Mean :2012-10-31 Mean :1177.5 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2 Max. :806.00 Max. :2012-11-30 Max. :2355.0 NA :2304 

The unit for calculating the median

 whynoclean<-aggregate(activity,by=list(activity$date),FUN=median,na.rm=TRUE) > summary(whynoclean) Group.1 steps date interval Min. :2012-10-01 Min. :0 Min. :2012-10-01 Min. :1178 1st Qu.:2012-10-16 1st Qu.:0 1st Qu.:2012-10-16 1st Qu.:1178 Median :2012-10-31 Median :0 Median :2012-10-31 Median :1178 Mean :2012-10-31 Mean :0 Mean :2012-10-31 Mean :1178 3rd Qu.:2012-11-15 3rd Qu.:0 3rd Qu.:2012-11-15 3rd Qu.:1178 Max. :2012-11-30 Max. :0 Max. :2012-11-30 Max. :1178 NA :8 

EDIT output for code using mutate

 activity %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T))) Source: local data table [17,568 x 3] steps date interval 1 NA 2012-10-01 0 2 NA 2012-10-01 5 3 NA 2012-10-01 10 4 NA 2012-10-01 15 5 NA 2012-10-01 20 6 NA 2012-10-01 25 7 NA 2012-10-01 30 8 NA 2012-10-01 35 9 NA 2012-10-01 40 10 NA 2012-10-01 45 .. ... ... ... 

UPDATE:

Stephen Beaupre helped me understand that my approach to imputation was wrong, as there were specific dates with only NA values ​​that cause the problem, since the median NA is NA. Used a different approach.

+7
r median plyr missing-data
source share
1 answer

Try:

 library(dplyr) df %>% group_by(date) %>% mutate(steps = ifelse(is.na(steps), median(steps, na.rm = T), steps)) 

If for a given date all steps are NA s, you can replace them with 0:

 df %>% group_by(date) %>% mutate(steps = ifelse(all(is.na(steps)), 0, ifelse(is.na(steps), median(steps, na.rm = T), steps))) 
+4
source share

All Articles