Customization
I have a duration column stored as a row in a data frame. I want to convert them to an appropriate time object, possibly POSIXlt . Most strings are easy to parse using this method :
> data <- data.frame(time.string = c( + "1 d 2 h 3 m 4 s", + "10 d 20 h 30 m 40 s", + "--")) > data$time.span <- strptime(data$time.string, "%jd %H h %M m %S s") > data$time.span [1] "2012-01-01 02:03:04" "2012-01-10 20:30:40" NA
Missing durations are encoded with "--" and must be converted to NA - this is already happening, but must be saved.
The problem is that the string omits the null elements. Thus, the desired value is 2012-01-01 02:00:14 string "1 d 2 h 14 s" . However, this line parses to NA using a simple parser:
> data2 <- data.frame(time.string = c( + "1 d 2 h 14 s", + "10 d 20 h 30 m 40 s", + "--")) > data2$time.span <- strptime(data2$time.string, "%jd %H h %M m %S s") > data2$time.span [1] NA "2012-01-10 20:30:40" NA
Questions
- What is the "R Way" to handle all possible string formats? Perhaps test and retrieve each element separately and then recombine?
- Is POSIXlt the correct target class? I need a duration free from any specific start time, so the problem with false data for the year and month (
2012-01- ) is worrying.
Decision
@mplourde definitely had the right idea with dynamically creating a format string based on testing various conditions in a date format. Adding cut(Sys.Date(), breaks='years') as the baseline for datediff also good, but did not take into account the critical as.POSIXct() in as.POSIXct() Note. I am using the R2.11 database, it may have been fixed in later versions.
The result of as.POSIXct() changes dramatically depending on whether the date component is enabled:
> x <- "1 d 1 h 14 m 1 s" > y <- "1 h 14 m 1 s" # Same string, no date component > format (x) # as specified below [1] "%jd %H h %M m %S s" > format (y) [1] "% H h % M %S s" > as.POSIXct(x,format=format) # Including the date baselines at year start [1] "2012-01-01 01:14:01 EST" > as.POSIXct(y,format=format) # Excluding the date baselines at today start [1] "2012-06-26 01:14:01 EDT"
So the second argument to the difftime function should be:
- Beginning of the first day of the current year if the input line contains a component of the day
- Beginning of the current day if the input line does not have a day component
This can be done by changing the unit parameter in the cut function:
parse.time <- function (x) { x <- as.character (x) break.unit <- ifelse(grepl("d",x),"years","days") # chooses cut() unit format <- paste(c(if (grepl("d", x)) "%jd", if (grepl("h", x)) "%H h", if (grepl("m", x)) "%M m", if (grepl("s", x)) "%S s"), collapse=" ") if (nchar(format) > 0) { difftime(as.POSIXct(x, format=format), cut(Sys.Date(), breaks=break.unit), units="hours") } else {NA} }