Attach data.table using fetch

Question

Attach data.table using fetch

I have several large datasets that I am trying to combine. I created a toy example of what I want to do. I have three tables:

require(data.table) set.seed(151) x <- data.table(a=1:100000) y <- data.table(b=letters[1:20],c=sample(LETTERS[1:4])) proportion <- data.table(expand.grid(a=1:100000,c=LETTERS[1:4])) proportion[,prop:=rgamma(4,shape = 1),by=a] proportion[,prop:=prop/sum(prop),by=a]

Three tables: x , y and proportion . For each element in x I want to select from the entire table y using the probabilities from the proportion table and combine them into another table. The method I came up with is:

 temp <- setkey(setkey(x[,c(k=1,.SD)],k)[y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL],a,c) temp <- temp[setkey(proportion,a,c)][,prop:=prop/.N,by=.(a,c)] # Uniform distribution within the same 'c' column group chosen_pairs <- temp[,.SD[sample(.N,5,replace=FALSE,prob = prop)],by=a]

But this method is intense and slow as it joins two tables first, and then a sample from it. Is there a way to accomplish this task in an efficient (memory and time) way?

+8

r data.table

A gore May 19, '17 at 20:52

source share

1 answer

minem · Answer 1 · 2017-06-02T06:59:34+0000

I ran into some similar problem in this question. I put your solution in a function for better comparison:

 goreF <- function(x,y,proportion){ temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)], allow.cartesian = TRUE][, k := NULL], a, c) temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)] chosen_pairs <- temp[, .SD[sample(.N, 5, replace = FALSE, prob = prop)], by = a] chosen_pairs }

My approach:

 myFunction <- function(x, y, proportion){ temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)], allow.cartesian = TRUE][, k := NULL], a, c) temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)] chosen_pairs <- temp[, sample(.I, 5, replace = FALSE, prob = prop), by = a] indexes <- chosen_pairs[[2]] temp[indexes] } require(rbenchmark) benchmark(myFunction(x, y, proportion), goreF(x, y, proportion), replications = 1, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self")) test replications elapsed relative user.self sys.self 2 goreF(x, y, proportion) 1 19.83 21.323 19.35 0.13 1 myFunction(x, y, proportion) 1 0.93 1.000 0.86 0.08

Perhaps more improvements can be found, I will update if I find them. The first two operations seem too complicated, perhaps they can be shortened, but since I did not see that they affect the calculation timings, I did not rewrite them.

Update:

As pointed out in the question I mentioned at the beginning, you may run into problems with myFunction if your groups contain only one element. So I changed it based on the comments of this post.

 myFunction2 <- function(x, y, proportion){ temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)], allow.cartesian = TRUE][, k := NULL], a, c) temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)] indexes <- temp[, .I[sample(.N, 5, replace = T, prob = prop)], by = a] indexes <- indexes[[2]] temp[indexes] } benchmark(myFunction(x, y, proportion), myFunction2(x, y, proportion), replications = 5, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self")) test replications elapsed relative user.self sys.self 1 myFunction(x, y, proportion) 5 6.61 1.064 6.23 0.36 2 myFunction2(x, y, proportion) 5 6.21 1.000 5.71 0.26

We can see an improvement in top speed.

Attach data.table using fetch

Update:

More articles: