Combining dplyr :: mutate with lubridate :: ymd_hms in R randomly calls segfault

I searched as best as possible for something related to this, but found nothing on SO or dplyr github; perhaps a new problem as the code below worked fine until today?

The problem is conceptually simple: sometimes calling my_data %>% mutate(x = ymd_hms(x)) , but not always (i.e. randomly) causes R to crash with segfault captured. I reduced the problem to its simplest form (also here: https://gist.github.com/john-sandall/05c3abb24fc738ddc2ad ):

 require(lubridate) require(dplyr) set.seed(42) make_some_random_datetimes = function(n) ymd("2015-01-01") + seconds(runif(n, min=0, max=60*60*24*365)) d = data.frame( col1 = make_some_random_datetimes(5000), col2 = make_some_random_datetimes(5000) ) do_it = function() { d %>% mutate( col1 = ymd_hms(col1), col2 = ymd_hms(col2) # for some reason, it only crashes when evaluating 2+ cols, if we removed this line it'd be fine ) return(TRUE) } do_it() # doesn't crash every time...it fails every nth time where n is randomly distributed with mean of roughly 7.7 do_it_lots_of_times = function(n) for (i in 1:n) do_it() do_it_lots_of_times(50) # almost guaranteed to fail on my machine 

So, at some point, running do_it() above calls segfault, the output of R in the terminal is

 *** caught segfault *** address 0x0, cause 'unknown' 

This morning I upgraded to version 3.2.1, but rolling back to 3.2.0 and reinstalling the libraries does not help. Then I tried uninstalling / reinstalling R (using brew install r with a completely updated / updated homebrew), and then again installing all the necessary packages above. Here is the output of sessionInfo ():

 R version 3.2.1 (2015-06-18) Platform: x86_64-apple-darwin14.3.0 (64-bit) Running under: OS X 10.10.3 (Yosemite) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_0.4.2 lubridate_1.3.3 loaded via a namespace (and not attached): [1] lazyeval_0.1.10 R6_2.0.1 assertthat_0.1 magrittr_1.5 plyr_1.8.3 parallel_3.2.1 [7] DBI_0.3.1 tools_3.2.1 memoise_0.2.1 Rcpp_0.11.6 stringi_0.5-2 digest_0.6.8 [13] stringr_1.0.0 

Being a statistician and ending with ideas, I decided to look at the distribution of the failure frequency, see if this can help shed light. If the start of do_it() above fails for the nth time, and I wrote down where n falls for 50 failures (for example, the 3rd attempt, then the 7th attempt), I get the following sequence:

 3, 7, 9, 20, 9, 9, 9, 7, 4, 23, 6, 3, 3, 3, 7, 7, 3, 9, 6, 6, 7, 10, 13, 7, 3, 7, 4, 7, 9, 6, 7, 7, 6, 6, 7, 7, 7, 9, 6, 12, 7, 7, 5, 9, 18, 6, 7, 9, 9, 7 

What gives me this distribution:

I don’t know how important or helps at all, although I noticed that increasing the number of rows in dataframe d from 5000 to 10000 seems to increase the average value of n from ~ 8 to ~ 20.

Any help on this would be greatly appreciated!

+7
segmentation-fault r lubridate dplyr
source share
1 answer

90% sure this is a bug in the latest version of dplyr (0.4.2), see here: https://github.com/hadley/dplyr/issues/1231

Downgrading my dplyr version to 0.4.1 as follows fixes the problem:

 packageurl = "http://cran.r-project.org/src/contrib/Archive/dplyr/dplyr_0.4.1.tar.gz" install.packages(packageurl, repos=NULL, type="source", dependencies = TRUE) 
+6
source share

All Articles