I am trying to find a more efficient way to count the number of cases opened since each case was created. The case is βopenβ between its creation date / time stamp and the censorship date / time stamp. You can copy-paste the code below to view a simple functional example:
# Create a bunch of date/time stamps for our example two_thousand <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_one <- as.POSIXct("2001-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_two <- as.POSIXct("2002-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_three <- as.POSIXct("2003-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_four <- as.POSIXct("2004-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_five <- as.POSIXct("2005-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_six <- as.POSIXct("2006-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_seven <- as.POSIXct("2007-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_eight <- as.POSIXct("2008-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_nine <- as.POSIXct("2009-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_ten <- as.POSIXct("2010-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_eleven <- as.POSIXct("2011-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand <- as.POSIXct("2000-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_one <- as.POSIXct("2001-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_mid_two <- as.POSIXct("2002-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_three <- as.POSIXct("2003-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_four <- as.POSIXct("2004-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_five <- as.POSIXct("2005-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_six <- as.POSIXct("2006-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_seven <- as.POSIXct("2007-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_eight <- as.POSIXct("2008-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_nine <- as.POSIXct("2009-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_ten <- as.POSIXct("2010-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_eleven <- as.POSIXct("2011-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
This approach is great for small datasets like the ones in this example. However, despite the fact that I use vectorized operations, the calculation time grows exponentially because the operation counter: (3 * nrow comparisons) * (calculated amounts (nrow)). With nrow = 10,000, the time is about 14 s, with nrow = 100,000, the time β 20 minutes. My actual amount is ~ 1,000,000.
Is there a more efficient way to do this calculation? I am currently looking at multi-core parameters, but even they will linearly reduce runtime. Your help is appreciated. Thanks!
EDIT: Added @ David-arenburg solution data.table::foverlaps , which also works and runs faster, 1000. However, it is slower than summarize solution for more rows. At 10,000 rows, it was twice as long. In 50,000 lines, I gave up waiting 10 times. Interestingly, the foverlaps solution does not seem to start automatic garbage collection, so it constantly sits with maximum RAM (64 GB on my system), while the summarize solution periodically starts automatic garbage collection, so it never exceeds ~ 40 GB of RAM. I'm not sure if this is due to differences in speed.
FINAL EDIT: I rewrote the question in such a way that it was much easier for respondents to create large tables with suitable / censored dateTimes created. I also simplified and explained the problem more clearly, making it clear that the lookup table is very large (violating the assumptions of data.table::foverlaps ). I even built a time comparison to make it very easy to test with big cases! Read more here: An effective method for calculating open cases when sending each case in a large data set
Thanks again for your help!:)