I ran into some similar problem in this question. I put your solution in a function for better comparison:
goreF <- function(x,y,proportion){ temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)], allow.cartesian = TRUE][, k := NULL], a, c) temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)] chosen_pairs <- temp[, .SD[sample(.N, 5, replace = FALSE, prob = prop)], by = a] chosen_pairs }
My approach:
myFunction <- function(x, y, proportion){ temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)], allow.cartesian = TRUE][, k := NULL], a, c) temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)] chosen_pairs <- temp[, sample(.I, 5, replace = FALSE, prob = prop), by = a] indexes <- chosen_pairs[[2]] temp[indexes] } require(rbenchmark) benchmark(myFunction(x, y, proportion), goreF(x, y, proportion), replications = 1, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self")) test replications elapsed relative user.self sys.self 2 goreF(x, y, proportion) 1 19.83 21.323 19.35 0.13 1 myFunction(x, y, proportion) 1 0.93 1.000 0.86 0.08
Perhaps more improvements can be found, I will update if I find them. The first two operations seem too complicated, perhaps they can be shortened, but since I did not see that they affect the calculation timings, I did not rewrite them.
Update:
As pointed out in the question I mentioned at the beginning, you may run into problems with myFunction if your groups contain only one element. So I changed it based on the comments of this post.
myFunction2 <- function(x, y, proportion){ temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)], allow.cartesian = TRUE][, k := NULL], a, c) temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)] indexes <- temp[, .I[sample(.N, 5, replace = T, prob = prop)], by = a] indexes <- indexes[[2]] temp[indexes] } benchmark(myFunction(x, y, proportion), myFunction2(x, y, proportion), replications = 5, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self")) test replications elapsed relative user.self sys.self 1 myFunction(x, y, proportion) 5 6.61 1.064 6.23 0.36 2 myFunction2(x, y, proportion) 5 6.21 1.000 5.71 0.26
We can see an improvement in top speed.