Select dmy dates with dmY using parse_date_time

Question

Select dmy dates with dmY using parse_date_time

I have a vector of a character representation of dates, where the formats are mainly dmY (e.g., 09/27/2013), dmY (e.g., 09/27/13), and sometimes some b or b months. Thus, parse_date_time in the lubridate package, which "allows the user to specify multiple format orders to handle heterogeneous representations of date and time characters," can be a very useful feature for me.

However, it seems that parse_date_time has the problem of parsing dmY dates when they occur with dmY dates. When parsing dmY or dmY along with some other formats relevant to me, it works great. This template was also noted in the comment on @Peyton's answer here . A quick fix was suggested, but I want to ask if it can be processed in lubridate .

Here I show some examples where I try to dmY dates in dmY format along with some other formats and indicate orders accordingly.

 library(lubridate) # version: lubridate_1.3.0 # regarding how date format is specified in 'orders': # examples in ?parse_date_time # parse_date_time(x, "ymd") # parse_date_time(x, "%y%m%d") # parse_date_time(x, "%y %m %d") # these order strings are equivalent and parses the same way # "Formatting orders might include arbitrary separators. These are discarded" # dmy date only parse_date_time(x = "27-09-13", orders = "dmy") # [1] "2013-09-27 UTC" # OK # dmy & dBY parse_date_time(c("27-09-13", "27 September 2013"), orders = c("dmy", "d BY")) # [1] "2013-09-27 UTC" "2013-09-27 UTC" # OK # dmy & dbY parse_date_time(c("27-09-13", "27 Sep 2013"), orders = c("dmy", "db Y")) # [1] "2013-09-27 UTC" "2013-09-27 UTC" # OK # dmy & dmY parse_date_time(c("27-09-13", "27-09-2013"), orders = c("dmy", "dm Y")) # [1] "0013-09-27 UTC" "2013-09-27 UTC" # not OK # does order of the date components matter? parse_date_time(c("2013-09-27", "13-09-13"), orders = c("Y md", "ymd")) # [1] "2013-09-27 UTC" "0013-09-27 UTC" # no

What about select_formats argument? I'm sorry to say that, but it's hard for me to understand this section of the help file. And search for select_formats on SO : 0 results. Nevertheless, this section looked relevant: "By default, formats with most tockens (%) formats are selected, and% Y - 2.5 tokens (so that it can take precedence over% y% m)." So I (desperately) tried with some additional dmY dates:

 parse_date_time(c("27-09-2013", rep("27-09-13", 10)), orders = c("dmy", "dm Y")) # not OK. Tried also 100 dmy dates. # does order in the vector matter? parse_date_time(c(rep("27-09-13", 10), "27-09-2013"), orders = c("dmy", "dm Y")) # no

Then I checked how the guess_formats function (also in lubridate ) handles dmY along with dmY :

 guess_formats(c("27-09-13", "27-09-2013"), c("dmy", "dmY"), print_matches = TRUE) # dmy dmY # [1,] "27-09-13" "%d-%m-%y" "" # [2,] "27-09-2013" "%d-%m-%Y" "%d-%m-%Y" # OK

From ?guess_formats : y also matches Y From ?parse_date_time : y* Year without century (00–99 or 0–99). Also matches year with century (Y format) y* Year without century (00–99 or 0–99). Also matches year with century (Y format) . So I tried:

 guess_formats(c("27-09-13", "27-09-2013"), c("dmy"), print_matches = TRUE) # dmy # [1,] "27-09-13" "%d-%m-%y" # [2,] "27-09-2013" "%d-%m-%Y" # OK

So guess_format seems to be dealing with dmY along with dmY . But how can I tell parse_date_time to do the same? Thanks in advance for any comments or help.

Update I posted the question in the lubridate error message and received a quick response from @vitoshka: "This is an error."

+6

date r lubridate

Henrik 01 Oct '13 at 22:30

source share

2 answers

This is really intentional. I remember it now. It is assumed that if you have dates of the form 01-02-1845 and 01-02-03 in the same vector, then you probably mean what is meant. It also avoids the confusion with dates of different centuries. You cannot know whether 17-05-13 to the 20th or 21st century.

There may also have been a technical reason for this solution, but I don’t remember right now.

Argument

.select_formats - path:

 my_select <- function(trained){ n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5 names(trained[ which.max(n_fmts) ]) } parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select) ## [1] "2013-09-27 UTC" "2013-09-27 UTC"

select_formats should return formats that will be applied sequentially to the input character vector. In the above example, you give priority in the format% y.

I am adding this example to the docs.

+1

VitoshKa Oct 2 '13 at 20:59

source share

agstudy · Accepted Answer · 2013-10-02T01:23:56+0000

Sounds like a mistake. I am not sure, so you should contact your attendant.

Building the package source and changing one line in this internal function (I replace which.max with wich.min ):

 .select_formats <- function(trained){ n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%Y", names(trained))*1.5 names(trained[ which.min(n_fmts) ]) ## replace which.max by which.min }

seems to fix the problem. Honestly, I do not know why this works, but I think this is a kind of rating.

 parse_date_time(c("27-09-13", "27-09-2013"), orders = c("dmy", "dm Y")) [1] "2013-09-27 UTC" "2013-09-27 UTC" parse_date_time(c("2013-09-27", "13-09-13"), orders = c("Y md", "ymd")) [1] "2013-09-27 UTC" "2013-09-13 UTC"

Select dmy dates with dmY using parse_date_time

More articles: