A more efficient method for counting open cases since the creation of each case

Question

A more efficient method for counting open cases since the creation of each case

I am trying to find a more efficient way to count the number of cases opened since each case was created. The case is “open” between its creation date / time stamp and the censorship date / time stamp. You can copy-paste the code below to view a simple functional example:

# Create a bunch of date/time stamps for our example two_thousand <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_one <- as.POSIXct("2001-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_two <- as.POSIXct("2002-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_three <- as.POSIXct("2003-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_four <- as.POSIXct("2004-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_five <- as.POSIXct("2005-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_six <- as.POSIXct("2006-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_seven <- as.POSIXct("2007-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_eight <- as.POSIXct("2008-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_nine <- as.POSIXct("2009-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_ten <- as.POSIXct("2010-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_eleven <- as.POSIXct("2011-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand <- as.POSIXct("2000-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_one <- as.POSIXct("2001-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_mid_two <- as.POSIXct("2002-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_three <- as.POSIXct("2003-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_four <- as.POSIXct("2004-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_five <- as.POSIXct("2005-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_six <- as.POSIXct("2006-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_seven <- as.POSIXct("2007-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_eight <- as.POSIXct("2008-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_nine <- as.POSIXct("2009-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_ten <- as.POSIXct("2010-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_eleven <- as.POSIXct("2011-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); # Create a table that has pairs of created & censored date/time stamps for cases, indicating the range during which each case is "open" comparison_table <- data.table(id = 1:10, created = c(two_thousand, two_thousand_two, two_thousand_four, two_thousand_six, two_thousand_eight, two_thousand_ten, two_thousand, two_thousand_six, two_thousand_three, two_thousand_one), censored = c(two_thousand_one, two_thousand_three, two_thousand_five, two_thousand_seven, two_thousand_nine, two_thousand_eleven, two_thousand_five, two_thousand_ten, two_thousand_eight, two_thousand_four)); # Create a table that has the creation date/time stamps at which we want to count all the open cases check_table <- data.table(id = 1:12, creation = c(mid_two_thousand, mid_two_thousand_one, mid_two_thousand_mid_two, mid_two_thousand_three, mid_two_thousand_four, mid_two_thousand_five, mid_two_thousand_six, mid_two_thousand_seven, mid_two_thousand_eight, mid_two_thousand_nine, mid_two_thousand_ten, mid_two_thousand_eleven)); # I use the DPLYR library as the group_by() + summarize() functions make this operation simple library(dplyr); # Group by id to set parameter for summarize() function check_table_grouped <- group_by(check_table, id); # For each id in the table, sum the number of times that its creation date/time stamp is greater than the creation date/time and less than the censor date/time of all cases in the comparison table # EDIT: Also added timing to compare with method below system.time(check_table_summary <- summarize(check_table_grouped, other_open_values_at_creation_count = sum((comparison_table$created < creation & comparison_table$censored > creation)))); # Result is as desired check_table_summary; # EDIT: Added @David-arenburg solution with timing library(data.table); setDT(check_table)[, creation2 := creation]; setkey(comparison_table, created, censored); system.time(foverlaps_table <- foverlaps(check_table, comparison_table, by.x = c("creation", "creation2"))[, sum(!is.na(id)), by = i.id]); # Same results as above foverlaps_table;

This approach is great for small datasets like the ones in this example. However, despite the fact that I use vectorized operations, the calculation time grows exponentially because the operation counter: (3 * nrow comparisons) * (calculated amounts (nrow)). With nrow = 10,000, the time is about 14 s, with nrow = 100,000, the time → 20 minutes. My actual amount is ~ 1,000,000.

Is there a more efficient way to do this calculation? I am currently looking at multi-core parameters, but even they will linearly reduce runtime. Your help is appreciated. Thanks!

EDIT: Added @ David-arenburg solution data.table::foverlaps , which also works and runs faster, 1000. However, it is slower than summarize solution for more rows. At 10,000 rows, it was twice as long. In 50,000 lines, I gave up waiting 10 times. Interestingly, the foverlaps solution does not seem to start automatic garbage collection, so it constantly sits with maximum RAM (64 GB on my system), while the summarize solution periodically starts automatic garbage collection, so it never exceeds ~ 40 GB of RAM. I'm not sure if this is due to differences in speed.

FINAL EDIT: I rewrote the question in such a way that it was much easier for respondents to create large tables with suitable / censored dateTimes created. I also simplified and explained the problem more clearly, making it clear that the lookup table is very large (violating the assumptions of data.table::foverlaps ). I even built a time comparison to make it very easy to test with big cases! Read more here: An effective method for calculating open cases when sending each case in a large data set

Thanks again for your help!:)

+3

performance vectorization r data.table dplyr

Mekki MacAulay Dec 10 '15 at 23:01

source share

1 answer

Khashaa · Accepted Answer · 2015-12-11T14:50:26+0000

Another foverlaps solution. Assuming comparison_table not too large

 library(data.table); setkey(comparison_table, created, censored); times <- sort(unique(c(comparison_table$created, comparison_table$censored))) dt <- data.table(creation=times+1)[, creation2 := creation]; setkey(dt, creation, creation2) x <- foverlaps(comparison_table, dt, by.x = c("created", "censored"))[,.N,creation]$N check_table$newcol <- x[findInterval(check_table$creation, times)]

A more efficient method for counting open cases since the creation of each case

More articles: