R time components from semi-standard strings

Customization

I have a duration column stored as a row in a data frame. I want to convert them to an appropriate time object, possibly POSIXlt . Most strings are easy to parse using this method :

> data <- data.frame(time.string = c( + "1 d 2 h 3 m 4 s", + "10 d 20 h 30 m 40 s", + "--")) > data$time.span <- strptime(data$time.string, "%jd %H h %M m %S s") > data$time.span [1] "2012-01-01 02:03:04" "2012-01-10 20:30:40" NA 

Missing durations are encoded with "--" and must be converted to NA - this is already happening, but must be saved.

The problem is that the string omits the null elements. Thus, the desired value is 2012-01-01 02:00:14 string "1 d 2 h 14 s" . However, this line parses to NA using a simple parser:

 > data2 <- data.frame(time.string = c( + "1 d 2 h 14 s", + "10 d 20 h 30 m 40 s", + "--")) > data2$time.span <- strptime(data2$time.string, "%jd %H h %M m %S s") > data2$time.span [1] NA "2012-01-10 20:30:40" NA 

Questions

  • What is the "R Way" to handle all possible string formats? Perhaps test and retrieve each element separately and then recombine?
  • Is POSIXlt the correct target class? I need a duration free from any specific start time, so the problem with false data for the year and month ( 2012-01- ) is worrying.

Decision

@mplourde definitely had the right idea with dynamically creating a format string based on testing various conditions in a date format. Adding cut(Sys.Date(), breaks='years') as the baseline for datediff also good, but did not take into account the critical as.POSIXct() in as.POSIXct() Note. I am using the R2.11 database, it may have been fixed in later versions.

The result of as.POSIXct() changes dramatically depending on whether the date component is enabled:

 > x <- "1 d 1 h 14 m 1 s" > y <- "1 h 14 m 1 s" # Same string, no date component > format (x) # as specified below [1] "%jd %H h %M m %S s" > format (y) [1] "% H h % M %S s" > as.POSIXct(x,format=format) # Including the date baselines at year start [1] "2012-01-01 01:14:01 EST" > as.POSIXct(y,format=format) # Excluding the date baselines at today start [1] "2012-06-26 01:14:01 EDT" 

So the second argument to the difftime function should be:

  • Beginning of the first day of the current year if the input line contains a component of the day
  • Beginning of the current day if the input line does not have a day component

This can be done by changing the unit parameter in the cut function:

 parse.time <- function (x) { x <- as.character (x) break.unit <- ifelse(grepl("d",x),"years","days") # chooses cut() unit format <- paste(c(if (grepl("d", x)) "%jd", if (grepl("h", x)) "%H h", if (grepl("m", x)) "%M m", if (grepl("s", x)) "%S s"), collapse=" ") if (nchar(format) > 0) { difftime(as.POSIXct(x, format=format), cut(Sys.Date(), breaks=break.unit), units="hours") } else {NA} } 
+7
source share
2 answers

difftime objects are time duration objects that can be added to POSIXct or POSIXlt . Maybe you want to use this instead of POSIXlt ?

Regarding the conversion from objects to time objects, you can do something like this:

 data <- data.frame(time.string = c( "1 d 1 h", "30 m 10 s", "1 d 2 h 3 m 4 s", "2 h 3 m 4 s", "10 d 20 h 30 m 40 s", "--")) f <- function(x) { x <- as.character(x) format <- paste(c(if (grepl('d', x)) '%j d', if (grepl('h', x)) '%H h', if (grepl('m', x)) '%M m', if (grepl('s', x)) '%S s'), collapse=' ') if (nchar(format) > 0) { if (grepl('%j d', format)) { # '%j 1' is day 0. We add a day so that x = '1 d' means 24hrs. difftime(as.POSIXct(x, format=format) + as.difftime(1, units='days'), cut(Sys.Date(), breaks='years'), units='hours') } else { as.difftime(x, format, units='hours') } } else { NA } } data$time.span <- sapply(data$time.string, FUN=f) 
+10
source

I think you're lucky with lubridate :

From dates and times made easy with lubridate :

5.3. Duration

...

Duration is invariant to leap years, leap seconds, and daylight saving time because duration is measured in seconds. Therefore, the durations have agreed lengths and can easily be compared with other durations. Duration is a suitable object to use when comparing temporal attributes such as speeds, speeds and lifetime. lubridate uses the difftime class from the R base for duration. Additional differential methods have been created to facilitate this.

lubridate uses the difftime class from the R base for duration. Additional differential methods have been created to facilitate this.

...

Duration objects can be easily created using the helper functions dyears (), dweeks (), ddays (), dhours (), dminutes (), and dseconds (). D in the header indicates the duration and distinguishes these objects from the objects of the period, which are discussed in section 5.4. Each object creates a duration in seconds using the above estimated ratios.

However, I have not yet found a function for parsing a string in duration.


You can also take a look at Ruby Chronic to see how elegant time can be. I did not find such a library for R.

+3
source

All Articles