I need to replace the missing values in the "steps" field with the median "steps" calculated for this particular day (group "by date") with the NA values removed. I already mentioned this thread , but my NA values are not replaced. Can someone help me find out where I'm going wrong? I would prefer to use the base package / data table / plyr. The data set looks ok. like this: -
steps date interval 1: NA 2012-10-01 0 2: NA 2012-10-01 5 3: NA 2012-10-01 10 4: NA 2012-10-01 15 5: NA 2012-10-01 20 --- 17564: NA 2012-11-30 2335 17565: NA 2012-11-30 2340 17566: NA 2012-11-30 2345 17567: NA 2012-11-30 2350 17568: NA 2012-11-30 2355
The structure and summary of the data set (activity) is given below.
#str(activity) Classes 'data.table' and 'data.frame': 17568 obs. of 3 variables: $ steps : int NA NA NA NA NA NA NA NA NA NA ... $ date : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" ... $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Things I tried:
Datatable method:
activityrepNA<-activity[,steps := ifelse(is.na(steps), median(steps, na.rm=TRUE), steps), by=date] summary(activityrepNA) steps date interval Min. : 0.00 Min. :2012-10-01 Min. : 0.0 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8 Median : 0.00 Median :2012-10-31 Median :1177.5 Mean : 37.38 Mean :2012-10-31 Mean :1177.5 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2 Max. :806.00 Max. :2012-11-30 Max. :2355.0 NA :2304
Using ave
activity$steps[is.na(activity$steps)] <- with(activity, ave(steps,date, FUN = function(x) median(x, na.rm = TRUE)))[is.na(activity$steps)] > summary(activity) steps date interval Min. : 0.00 Min. :2012-10-01 Min. : 0.0 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8 Median : 0.00 Median :2012-10-31 Median :1177.5 Mean : 37.38 Mean :2012-10-31 Mean :1177.5 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2 Max. :806.00 Max. :2012-11-30 Max. :2355.0 NA :2304
Attempting to execute ddply
cleandatapls<-ddply(activity, + .(as.character(date)), + transform, + steps=ifelse(is.na(steps), median(steps, na.rm=TRUE), steps)) > summary(cleandatapls) as.character(date) steps date interval Length:17568 Min. : 0.00 Min. :2012-10-01 Min. : 0.0 Class :character 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8 Mode :character Median : 0.00 Median :2012-10-31 Median :1177.5 Mean : 37.38 Mean :2012-10-31 Mean :1177.5 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2 Max. :806.00 Max. :2012-11-30 Max. :2355.0 NA :2304
The unit for calculating the median
whynoclean<-aggregate(activity,by=list(activity$date),FUN=median,na.rm=TRUE) > summary(whynoclean) Group.1 steps date interval Min. :2012-10-01 Min. :0 Min. :2012-10-01 Min. :1178 1st Qu.:2012-10-16 1st Qu.:0 1st Qu.:2012-10-16 1st Qu.:1178 Median :2012-10-31 Median :0 Median :2012-10-31 Median :1178 Mean :2012-10-31 Mean :0 Mean :2012-10-31 Mean :1178 3rd Qu.:2012-11-15 3rd Qu.:0 3rd Qu.:2012-11-15 3rd Qu.:1178 Max. :2012-11-30 Max. :0 Max. :2012-11-30 Max. :1178 NA :8
EDIT output for code using mutate
activity %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T))) Source: local data table [17,568 x 3] steps date interval 1 NA 2012-10-01 0 2 NA 2012-10-01 5 3 NA 2012-10-01 10 4 NA 2012-10-01 15 5 NA 2012-10-01 20 6 NA 2012-10-01 25 7 NA 2012-10-01 30 8 NA 2012-10-01 35 9 NA 2012-10-01 40 10 NA 2012-10-01 45 .. ... ... ...
UPDATE:
Stephen Beaupre helped me understand that my approach to imputation was wrong, as there were specific dates with only NA values that cause the problem, since the median NA is NA. Used a different approach.