An effective method for calculating open cases for each presentation of each case in a large data set

Question

An effective method for calculating open cases for each presentation of each case in a large data set

In a large data set (~ 1M cases), each case has a "created" and "censored" dateTime . I want to calculate the number of other cases that were discovered at the time each case was created. Cases are open between their "created" and "censored" dataTimes .

Several solutions work well on small datasets (<100,000 cases), but the computation time grows exponentially. My assessment is that the computation time increases as a function of 3n ^ 2. For n = 100,000 cases, the computation time is> 20 minutes on my server with 6 * 4 GHz cores and 64 GB of RAM. Even with multi-core libraries, at best, I could reduce the time by 8 or 10 times. Not enough to handle cases of ~ 1M.

I am looking for a more efficient method for this calculation. Below I have provided a function that allows you to easily create a large number of “created” and “censored” dateTime pairs along with two previously resolved solutions using the dplyr and data.table . Timers are reported to the user for simplicity. You can simply change the "CASE_COUNT" variable at the top to re-execute and view the time again and easily compare the timing of other solutions that you could offer.

I am updating the original post with other solutions to give credit to their authors. Thanks in advance for your help!

 # Load libraries used in this example library(dplyr); library(data.table); # Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html library(IRanges); # Set seed for reproducibility set.seed(123) # Set number of cases & date range variables CASE_COUNT <<- 1000; RANGE_START <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); RANGE_END <- as.POSIXct("2012-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); # Select which solutions you want to run in this test RUN_SOLUTION_1 <- TRUE; # dplyr::summarize() + comparisons RUN_SOLUTION_2 <- TRUE; # data.table:foverlaps() RUN_SOLUTION_3 <- TRUE; # data.table aggregation + comparisons RUN_SOLUTION_4 <- TRUE; # IRanges::IRanges + countOverlaps() RUN_SOLUTION_5 <- TRUE; # data.table::frank() # Function to generate random creation & censor dateTime pairs # The censor time always has to be after the creation time # Credit to @DirkEddelbuettel for this smart function # (https://stackoverflow.com/users/143305/dirk-eddelbuettel) generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) { # Measure duration between start_val & end_val duration <- as.numeric(difftime(end_val, start_val, unit="secs")); # Select random values in duration to create start_offset start_offset <- runif(n, 0, duration); # Calculate the creation time list created_list <- start_offset + start_val; # Calculate acceptable time range for censored values # since they must always be after their respective creation value censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs")); # Select random values in duration to create end_offset creation_to_censored_times <- runif(n, 0, censored_range); censored_list <- created_list + creation_to_censored_times; # Create and return a data.table with creation & censor values # calculated from start or end with random offsets return_table <- data.table(id = 1:n, created = created_list, censored = censored_list); return(return_table); } # Create the data table with the desired number of cases specified by CASE_COUNT above cases_table <- generate_cases_table(); solution_1_function <- function (cases_table) { # SOLUTION 1: Using dplyr::summarize: # Group by id to set parameters for summarize() function cases_table_grouped <- group_by(cases_table, id); # Count the instances where other cases were created before # and censored after each case using vectorized sum() within summarize() cases_table_summary <- summarize(cases_table_grouped, open_cases_at_creation = sum((cases_table$created < created & cases_table$censored > created))); solution_1_table <<- as.data.table(cases_table_summary, key="id"); } # End solution_1_function solution_2_function <- function (cases_table) { # SOLUTION 2: Using data.table::foverlaps: # Adapted from solution provided by @Davidarenburg # (https://stackoverflow.com/users/3001626/david-arenburg) # The foverlaps() solution tends to crash R with large case counts # I suspect it has to do with memory assignment of the very large objects # It maxes RAM on my system (64GB) before crashing, possibly attempting # to write beyond its assigned memory limits. # I'll submit a reproduceable bug to the data.table team since # foverlaps() is pretty new and known to be occasionally unstable if (CASE_COUNT > 50000) { stop("The foverlaps() solution tends to crash R with large case counts. Not running."); } setDT(cases_table)[, created_dupe := created]; setkey(cases_table, created, censored); foverlaps_table <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE], cases_table[,c("id","created","censored"), with=FALSE], by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id]; foverlaps_table <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1); solution_2_table <<- as.data.table(foverlaps_table, key="id"); } # End solution_2_function solution_3_function <- function (cases_table) { # SOLUTION 3: Using data.table aggregation instead of dplyr::summarize # Idea suggested by @jangorecki # (https://stackoverflow.com/users/2490497/jangorecki) # Count the instances where other cases were created before # and censored after each case using vectorized sum() with data.table aggregation cases_table_aggregated <- cases_table[order(id), sum((cases_table$created < created & cases_table$censored > created)),by=id]; solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id"); } # End solution_3_function solution_4_function <- function (cases_table) { # SOLUTION 4: Using IRanges package # Adapted from solution suggested by @alexis_laz # (https://stackoverflow.com/users/2414948/alexis-laz) # The IRanges package generates ranges efficiently, intended for genome sequencing # but working perfectly well on this data, since POSIXct values are numeric-representable solution_4_table <<- data.table(id = cases_table$id, open_cases_at_creation = countOverlaps(IRanges(cases_table$created, cases_table$created), IRanges(cases_table$created, cases_table$censored))-1, key="id"); } # End solution_4_function solution_5_function <- function (cases_table) { # SOLUTION 5: Using data.table::frank() # Adapted from solution suggested by @danas.zuokas # (https://stackoverflow.com/users/1249481/danas-zuokas) n <- CASE_COUNT; # For every case compute the number of other cases # with `created` less than `created` of other cases r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n]; # For every case compute the number of other cases # with `censored` less than `created` r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n]; solution_5_table <<- data.table(id = cases_table$id, open_cases_at_creation = r1 - r2, key="id"); } # End solution_5_function; # Execute user specified functions; if (RUN_SOLUTION_1) solution_1_timing <- system.time(solution_1_function(cases_table)); if (RUN_SOLUTION_2) { solution_2_timing <- try(system.time(solution_2_function(cases_table))); cases_table <- select(cases_table, -created_dupe); } if (RUN_SOLUTION_3) solution_3_timing <- system.time(solution_3_function(cases_table)); if (RUN_SOLUTION_4) solution_4_timing <- system.time(solution_4_function(cases_table)); if (RUN_SOLUTION_5) solution_5_timing <- system.time(solution_5_function(cases_table)); # Check generated tables for comparison if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") { same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation); } else {same_check1_2 <- TRUE;} if (RUN_SOLUTION_1 && RUN_SOLUTION_3) { same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation); } else {same_check1_3 <- TRUE;} if (RUN_SOLUTION_1 && RUN_SOLUTION_4) { same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation); } else {same_check1_4 <- TRUE;} if (RUN_SOLUTION_1 && RUN_SOLUTION_5) { same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation); } else {same_check1_5 <- TRUE;} if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") { same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation); } else {same_check2_3 <- TRUE;} if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") { same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation); } else {same_check2_4 <- TRUE;} if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") { same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation); } else {same_check2_5 <- TRUE;} if (RUN_SOLUTION_3 && RUN_SOLUTION_4) { same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation); } else {same_check3_4 <- TRUE;} if (RUN_SOLUTION_3 && RUN_SOLUTION_5) { same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation); } else {same_check3_5 <- TRUE;} if (RUN_SOLUTION_4 && RUN_SOLUTION_5) { same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation); } else {same_check4_5 <- TRUE;} same_check <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5, same_check2_3, same_check2_4, same_check2_5, same_check3_4, same_check3_5, same_check4_5); # Report summary of results to user cat("This execution was for", CASE_COUNT, "cases.\n", "It is", same_check, "that all solutions match.\n"); if (RUN_SOLUTION_1) cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n"); if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n"); if (RUN_SOLUTION_3) cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n"); if (RUN_SOLUTION_4) cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n"); if (RUN_SOLUTION_5) cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");

The solution data.table::foverlaps() is faster for fewer cases (<5000 or so, depends on randomness in addition to n, since binary search is used for optimization). The dplyr::summarize() solution is faster for more cases (> 5000 or so). More than 100,000, no solution is viable, as both are too slow.

EDIT: Added a third solution based on the idea proposed by @jangorecki, which uses data.table aggregation instead of dplyr::summarize() and is otherwise similar to dplyr solution. This is about 50,000 cases, this is the fastest solution. Apart from the 50,000 cases, the dplyr::summarize() solution is slightly faster, but not by much. Unfortunately, for 1M cases, this is still impractical.

EDIT2: Added a fourth solution, adapted from the solution proposed by @alexis_laz, which uses the IRanges package and its countOverlaps function. This is significantly faster than 3 other solutions. With 50,000 cases, it was almost 400% faster than solutions 1 and 3.

EDIT3: Modified case generation function to work properly with the censored state. Thanks to @jangorecki for limiting the limitations of the previous version.

EDIT4: rewritten to allow the user to choose which decisions to execute and use system.time() to compare the time with garbage collection before each execution for a more accurate time (according to a wonderful observation by @jangorecki). Also added some condition checks for failure cases.

EDIT5: added a fifth solution adapted from the solution proposed by @ danas.zuokas using rank() . My experiments suggest that it is always at least an order of magnitude slower than other solutions. With 10,000 cases for IRanges , 44 seconds versus 3.5 seconds are required for IRanges and 0.36 s for IRanges .

FINAL EDITING: I made minor changes to solution 5 proposed by @ danas.zuokas and the corresponding observation by @Khashaa about types. I set the type as.numeric to the dataTime generation dataTime , which dramatically speeds up rank because it works with integers or doubles instead of dateTime objects (it increases the speed of other functions too, but not as dramatically). With some testing, setting ties.method='first' produces results consistent with intent. data.table::frank faster than base::rank and IRanges::rank . bit64::rank is the fastest, but it seems to handle communications differently than data.table::frank , and I can't get it to handle them on my own. Once bit64 loads, it disguises a large number of types and functions, modifying the results of data.table::frank along the way. Specific reasons why this is beyond the scope of this question.

RACK REPORT NOTE. It turns out that data.table::frank efficiently handles POSIXct dateTimes , while neither base::rank nor IRanges::rank are displayed. Thus, even setting the type as.numeric (or as.integer ) is not required with data.table::frank , and there is no loss of precision from the conversion, so there are fewer ties.method inconsistencies. Thanks to everyone who contributed! I learned a lot! Very grateful! :) A loan will be included in my source code.

ENDNOTE: this question is a refined and refined version with a simpler and clearer code example. A more efficient method for counting open cases since the creation of each case - I divided it here so as not to overload the original message with too many changes and simplify the creation of a large number of dataTime pairs in sample code. That way you don’t have to answer so hard. Thanks again!

+7

performance vectorization r data.table dplyr

Mekki MacAulay Dec 12 '15 at 9:39

source share

2 answers

This probably does not answer exactly your question, since the reproduced example is not subject to the condition cases_table$censored > created , see min and max below. A smaller example will help you identify such problems. Also you should use set.seed in your example.

 set.seed(123) library(data.table) CASE_COUNT <- 1000; RANGE_START <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); RANGE_END <- as.POSIXct("2012-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); generate_cases_table <- function(n = CASE_COUNT, start=RANGE_START, end=RANGE_END) { half_duration <- as.numeric(difftime(end, start, unit="sec")) / 2; start_offset <- runif(n, 0, half_duration); end_offset <- runif(n, 0, half_duration); data.table(id = 1:n,created = start + start_offset,censored = end - end_offset) } cases_table = generate_cases_table() cases_table[, .(min_censored = min(censored), max_created = max(created))] # min_censored max_created #1: 2006-01-01 13:02:12 2005-12-30 04:40:49 setorder(cases_table, created)[, created_so_far := .I - 1L] cases_table[, censored_after := cases_table[cases_table, on = c("created" = "censored"), roll = Inf, which = TRUE]]

You may need to change the roll connection, but I could not verify due to the mentioned problem with sample data.
The argument which simply extracts the row number from the sliding join, for sorted data, this also means counting the observations after the join. The mentioned results of the problem, whose value is always 1000, everything created less than censored .
Detailed description of the data. Table sliding connections see this post: http://gormanalysis.com/r-data-table-rolling-joins/
As soon as you manage to apply this solution, share your time difference in the comments.

+3

jangorecki Dec 13 '15 at 10:52

source share

danas.zuokas · Accepted Answer · 2015-12-14T13:14:03+0000

The answer is updated in connection with the comment of the author of the question.

I would suggest a solution using ranks. Tables are created in accordance with this question or using the dateTime pair generation function in this question. Both should work.

 n <- cases_table[, .N] # For every case compute the number of other cases # with `created` less than `creation` of other cases r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n] # For every case compute the number of other cases # with `censored` less than `created` r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n]

Taking the difference r1 - r2 (-1 is not required using the links .method = 'first') gives the result (excluding the created ranks). From the point of view of efficiency, only a search for the ranks of the length vector of this number of rows in cases_table . data.table::frank processes POSIXct dateTime objects as fast as numeric objects (unlike base::rank ), so type conversion is not required.

An effective method for calculating open cases for each presentation of each case in a large data set

More articles: