How to vectorize or otherwise speed up this loop logic in R?

For a long time the catcher, the first time asked.

I am trying to calculate the "items between two item sets" for a 20M + items dataset. An example of the data is as follows.

#serially numbered items parents <- rep(1:10000) #generate rnorm # of children items numchild <- round(rnorm(10000, mean=30, sd=10)) #fill the parent-child list parent_child <- list() for (x in 1:length(parents)){ if (numchild[x]>0){ f1 <- sample(1:length(parents), size=numchild[x]) f2 <- list(parents[f1]) parent_child <- c(parent_child, f2) } else { parent_child <- c(parent_child, list(x+1)) #if numchild=0, make up something } } 

Here is what I want to do: let's say parent element # 1 has 5 children - 1,2,3,4,5, and parent element # 2 - 3 children - 4,10,22.

I want to calculate the length (intersection) of each (parent_i, parent_j) combination. In the above case, it will be one common element - 4.

I do this for 10M + parents, which on average have 15-20 children with a range of (0,100). So the item-item matrix is ​​10M x 10M in size.

I have a foreach loop that I am testing on a smaller subset that works but does not scale well for a complete data set (64-core machine with 256 GB of RAM). With the loop below, I am already calculating only half of the user matrix → (parent_i, parent_j), the same as (parent_j, parent_i) for this purpose.

 #small subset a <- parent_child[1:1000] outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% { b <- a[[i]] rest <- a[i+1:length(a)] foreach (j = 1:(length(rest)), .combine=rbind) %dopar% { common <- length(intersect(b, rest[[j]])) if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)} } } 

I experimented with these options (using "Reduce", "Save parent children to daataframe, etc."), but didn't have much luck.

Is there any way to make this scale?

+5
vectorization loops foreach r
source share
3 answers

I canceled the separation so that we have a parent-parent relationship

 len <- sapply(parent_child, length) child_parent <- split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE)) 

Something like the following constructs: a string with pairs of parents sharing children for all children

 keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], function(x) { x <- combn(sort(x), 2) paste(x[1,], x[2,], sep=".") }) 

and counting

 table(unlist(int, use.names=FALSE)) 

or a little faster

 xx <- unlist(int, use.names=FALSE) nms <- unique(xx) cnt <- match(xx, nms) setNames(tabulate(cnt, length(nms), nms) 

for

 f1 <- function(parent_child) { len <- sapply(parent_child, length) child_parent <- split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE)) keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], function(x) { x <- combn(sort(x), 2) paste(x[1,], x[2,], sep=".") }) xx <- unlist(int, use.names=FALSE) nms <- unique(xx) cnt <- match(xx, nms) setNames(tabulate(cnt, length(nms)), nms) } 

with (this is for all 10,000 parent-child elements)

 > system.time(ans1 <- f1(parent_child)) user system elapsed 14.625 0.012 14.668 > head(ans1) 542.1611 542.1832 542.2135 542.2435 542.2527 542.2806 1 1 1 1 1 1 

I am not sure that this will really scale to the size of the problem you are talking about, although it is a polynomial in the number of parents for each child.

One possibility of acceleration is to "memoize" combinatorial calculation, using the length of the argument as a "key" and saving the combination as a "value". This reduces the number of times that combn is called by the number of unique lengths of the child_parent elements.

 combn1 <- local({ memo <- new.env(parent=emptyenv()) function(x) { key <- as.character(length(x)) if (!exists(key, memo)) memo[[key]] <- t(combn(length(x), 2)) paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".") } }) f2 <- function(parent_child) { len <- sapply(parent_child, length) child_parent <- split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE)) keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], combn1) xx <- unlist(int, use.names=FALSE) nms <- unique(xx) cnt <- match(xx, nms) setNames(tabulate(cnt, length(nms)), nms) } 

which helps a few

 > system.time(ans2 <- f2(parent_child)) user system elapsed 5.337 0.000 5.347 > identical(ans1, ans2) [1] TRUE 

The slower part is now paste

 > Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof() $by.self self.time self.pct total.time total.pct "paste" 3.92 73.41 3.92 73.41 "match" 0.74 13.86 0.74 13.86 "unique.default" 0.40 7.49 0.40 7.49 "as.character" 0.08 1.50 0.08 1.50 "unlist" 0.08 1.50 0.08 1.50 "combn" 0.06 1.12 0.06 1.12 "lapply" 0.02 0.37 4.00 74.91 "any" 0.02 0.37 0.02 0.37 "setNames" 0.02 0.37 0.02 0.37 $by.total ... 

We can avoid this by encoding the parents with a common user ID into a single whole; due to the fact that the numbers of floating point numbers are represented in R, this will be exactly up to about 2 ^ 21

 encode <- function(x, y, n) (x - 1) * (n + 1) + y decode <- function(z, n) list(x=ceiling(z / (n + 1)), y = z %% (n + 1)) 

and adjusting our functions combn1 and f2 as

 combn2 <- local({ memo <- new.env(parent=emptyenv()) function(x, encode_n) { key <- as.character(length(x)) if (!exists(key, memo)) memo[[key]] <- t(combn(length(x), 2)) encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n) } }) f3 <- function(parent_child) { encode_n <- length(parent_child) len <- sapply(parent_child, length) child_parent <- unname(split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE))) keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], combn2, encode_n) id <- unlist(int, use.names=FALSE) uid <- unique(xx) n <- tabulate(match(xx, uid), length(uid)) do.call(data.frame, c(decode(uid, encode_n), list(n=n))) } 

leading to

 > system.time(f3(parent_child)) user system elapsed 2.140 0.000 2.146 

This is very beneficial (note that the time in the previous line corresponds to 10,000 parent-child relationships) with jlhoward's modified answer

 > system.time(result.3 <- do.call("rbind",lapply(1:99,gg))) user system elapsed 2.465 0.000 2.468 > system.time(f3(parent_child[1:99])) user system elapsed 0.016 0.000 0.014 

and scales much more intelligently.

For what it's worth, the data generation procedure is in the second round of Patrick Burn R Inferno, using the copy-and-append algorithm, and not pre-allocating space and filling it. Avoid this by writing for body of the loop as a function and using lapply. Avoid the need for a complex conditional in a for loop by fixing the problem before writing

 numchild <- round(rnorm(10000, mean=30, sd=10)) numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0)) 

or by fetching from a distribution (rpois, rbinom) that generates positive integer values. Data generation then

 n_parents <- 10000 numchild <- round(rnorm(n_parents, mean=30, sd=10)) numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0)) parent_child <- lapply(numchild, sample, x=n_parents) 
+6
source share

Here is another approach that is about 10X faster than my previous answer, and 17X faster than the source code (also easier):

 ff <- function(u2, u1, a) { common <- length(intersect(a,parent_child[[u2]])) if (common>0) {return(data.frame(u1,u2,common))} } gg <- function(u1) { a <- parent_child[[u1]] do.call("rbind",lapply((u1+1):100,ff,u1,a)) } system.time(result.3 <- do.call("rbind",lapply(1:99,gg))) user system elapsed 1.04 0.00 1.03 

result.3 is identical to result.2 from the previous answer:

 max(abs(result.3-result.2)) [1] 0 
+2
source share

Well, a slight improvement (I think):

Original code (wrapped in a function call):

 f = function(n) { #small subset a <- parent_child[1:n] outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% { b <- a[[i]] rest <- a[i+1:length(a)] foreach (j = 1:(length(rest)), .combine=rbind) %dopar% { common <- length(intersect(b, rest[[j]])) if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)} } } return(outerresults) } 

Modified Code:

 g <- function(n) { a <- parent_child[1:n] outerresults <- foreach (i = 1:n, .combine=rbind, .packages=c('foreach','doParallel')) %dopar% { b <- a[[i]] foreach (j = (i):n, .combine=rbind) %dopar% { if (i!=j) { c <- a[[j]] common <- length(intersect(b, c)) if (common > 0) {g <- data.frame(u1=i, u2=j, common)} } } } return(outerresults) } 

Landmarks:

 system.time(result.old<-f(100)) user system elapsed 17.21 0.00 17.33 system.time(result.new<-g(100)) user system elapsed 10.42 0.00 10.47 

The numbering for u2 is slightly different from the different approaches, but both produce the same match vector:

 max(abs(result.old$common-result.new$common)) [1] 0 

I tried this with replacing data table tables instead of intersect(...) , and it was actually much slower (!!)

+1
source share

All Articles