Rounding start

I am trying to implement column rounding so that the current sum of the rounded values ​​corresponds to the current sum of the original values ​​within the group.

The sample data for the task consists of three columns:

  • numbers - the values ​​that I need to round;
  • ids - determine the order of values; there may be a date for these time series;
  • group - defines the group in which I need to round numbers.

Here is an example of data already ordered by identifiers within a group:

numbers ids group 35.07209 1 1 27.50931 2 1 70.62019 3 1 99.55451 6 1 34.40472 8 1 17.58864 10 1 93.66178 4 3 83.21700 5 3 63.89058 7 3 88.96561 9 3 

To generate sample data for testing, I use this code:

  # Make data sample. x.size <- 10^6 x <- list("numbers" = runif(x.size) * 100, "ids" = 1:x.size, "group" = ifelse(runif(x.size) > 0.2 ,1, ifelse(runif(x.size) > 0.8, 2, 3))) x<- data.frame(x) x <- x[order(x$group), ] 

I wrote a function that saves the rounding state inside a group to make sure that the overall value of the round values ​​is correct:

 makeRunRound <- function() { # Data must be sorted by id. cumDiff <- 0 savedId <- 0 function(x, id) { # id here represents the group. if(id != savedId) { cumDiff <<- 0 savedId <<- id } xInt <- floor(x) cumDiff <<- x - xInt + cumDiff if(cumDiff > 1) { xInt <- xInt + round(cumDiff) cumDiff <<- cumDiff - round(cumDiff) } return (xInt) } } runRound <- makeRunRound() 

This approach works, and I would be glad if it weren’t for speed.

It takes 2-3 seconds to complete rounding of a 1 milligram sample.

This is too long for me, and there is another way in this matter that works six times faster. I save the code in josliber's answer :

 smartRound <- function(x) { y <- floor(x) indices <- tail(order(xy), round(sum(x)) - sum(y)) y[indices] <- y[indices] + 1 y } 

Using the sample data generated by the above code, a comparative analysis:

 # Code to benchmark speed. library(microbenchmark) res <- microbenchmark( "run.df" = x$mrounded <- mapply(FUN=runRound, x$numbers, x$group), "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)], "smart.df" = x$smart.round <- smartRound(x$numbers), "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)], "silly" = x$silly.round <- round(x$numbers), times = 50 ) print(res) boxplot(res) 

gives the following results:

Test for different rounding methods

 Unit: milliseconds expr min lq mean median uq max neval run.df 3475.69545 3827.13649 3994.09184 3967.27759 4179.67702 4472.18679 50 run.dt 2449.05820 2633.52337 2895.51040 2881.87608 3119.42219 3617.67113 50 smart.df 488.70854 537.03179 576.57704 567.63077 611.81271 861.76436 50 smart.dt 390.35646 414.96749 468.95317 457.85820 507.54395 631.17081 50 silly 13.72486 15.82744 19.41796 17.19057 18.85385 88.06329 50 

Thus, the speed changes from 20 ms for rounding to a level of 2.6 s for a method that takes into account the current number of rounded values ​​within a group.

I have included a comparison of calculations based on data.frame and data.table to demonstrate that there is no significant difference, although data.table improves performance a bit.

I really appreciate the simplicity and speed of smartRound , but this is not in the order of the elements, so the result will be different from what I need.

Is there any way:

  • change smartRound to achieve the same results as runRound without losing performance?
  • or, modify runRound to improve performance?
  • or is there an even better solution together?

EDIT:

Answer

dww gives the fastest solution:

 diffRound <- function(x) { diff(c(0, round(cumsum(x)))) } 

I reduced the test to four options:

 res <- microbenchmark( "silly" = x$silly.round <- round(x$numbers), "diff(dww)" = smart.round.dt <- x.dt[, .(rounded = diffRound(numbers)), by = .(group)] , "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)], "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)], times = 50 ) 

New Results:

Updated test

 Unit: milliseconds expr min lq mean median uq max neval silly 14.67823 16.64882 17.31416 16.83338 17.67497 22.48689 50 diff(dww) 54.57762 70.11553 76.67135 71.37325 76.83717 139.18745 50 smart.dt 392.83240 408.65768 456.46592 441.33212 492.67824 592.57723 50 run.dt 2564.02724 2651.13994 2751.80516 2708.45317 2830.44553 3101.71005 50 

Thanks to dww, I have a 6-fold increase in productivity without loss of accuracy.

0
source share
1 answer

I would do it this way, with simple basic vectorized functions:

first calculate the current total number of source numbers and the rounded value of that total. Then find the list of numbers that add to this rounded total, using diff () to see how each rounded total is larger than the last.

 cum.sum <- cumsum(x$numbers) cum.sum.rounded <- round(cum.sum) numbers.round <- diff(cum.sum.rounded) numbers.round <- c(cum.sum.rounded[1], numbers.round) 

Make sure everything is the way you want:

 check.cs <- cumsum(numbers.round) all( abs(check.cs - cum.sum) <=1 ) #TRUE 
+2
source

All Articles