I canceled the separation so that we have a parent-parent relationship
len <- sapply(parent_child, length) child_parent <- split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE))
Something like the following constructs: a string with pairs of parents sharing children for all children
keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], function(x) { x <- combn(sort(x), 2) paste(x[1,], x[2,], sep=".") })
and counting
table(unlist(int, use.names=FALSE))
or a little faster
xx <- unlist(int, use.names=FALSE) nms <- unique(xx) cnt <- match(xx, nms) setNames(tabulate(cnt, length(nms), nms)
for
f1 <- function(parent_child) { len <- sapply(parent_child, length) child_parent <- split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE)) keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], function(x) { x <- combn(sort(x), 2) paste(x[1,], x[2,], sep=".") }) xx <- unlist(int, use.names=FALSE) nms <- unique(xx) cnt <- match(xx, nms) setNames(tabulate(cnt, length(nms)), nms) }
with (this is for all 10,000 parent-child elements)
> system.time(ans1 <- f1(parent_child)) user system elapsed 14.625 0.012 14.668 > head(ans1) 542.1611 542.1832 542.2135 542.2435 542.2527 542.2806 1 1 1 1 1 1
I am not sure that this will really scale to the size of the problem you are talking about, although it is a polynomial in the number of parents for each child.
One possibility of acceleration is to "memoize" combinatorial calculation, using the length of the argument as a "key" and saving the combination as a "value". This reduces the number of times that combn is called by the number of unique lengths of the child_parent elements.
combn1 <- local({ memo <- new.env(parent=emptyenv()) function(x) { key <- as.character(length(x)) if (!exists(key, memo)) memo[[key]] <- t(combn(length(x), 2)) paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".") } }) f2 <- function(parent_child) { len <- sapply(parent_child, length) child_parent <- split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE)) keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], combn1) xx <- unlist(int, use.names=FALSE) nms <- unique(xx) cnt <- match(xx, nms) setNames(tabulate(cnt, length(nms)), nms) }
which helps a few
> system.time(ans2 <- f2(parent_child)) user system elapsed 5.337 0.000 5.347 > identical(ans1, ans2) [1] TRUE
The slower part is now paste
> Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof() $by.self self.time self.pct total.time total.pct "paste" 3.92 73.41 3.92 73.41 "match" 0.74 13.86 0.74 13.86 "unique.default" 0.40 7.49 0.40 7.49 "as.character" 0.08 1.50 0.08 1.50 "unlist" 0.08 1.50 0.08 1.50 "combn" 0.06 1.12 0.06 1.12 "lapply" 0.02 0.37 4.00 74.91 "any" 0.02 0.37 0.02 0.37 "setNames" 0.02 0.37 0.02 0.37 $by.total ...
We can avoid this by encoding the parents with a common user ID into a single whole; due to the fact that the numbers of floating point numbers are represented in R, this will be exactly up to about 2 ^ 21
encode <- function(x, y, n) (x - 1) * (n + 1) + y decode <- function(z, n) list(x=ceiling(z / (n + 1)), y = z %% (n + 1))
and adjusting our functions combn1 and f2 as
combn2 <- local({ memo <- new.env(parent=emptyenv()) function(x, encode_n) { key <- as.character(length(x)) if (!exists(key, memo)) memo[[key]] <- t(combn(length(x), 2)) encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n) } }) f3 <- function(parent_child) { encode_n <- length(parent_child) len <- sapply(parent_child, length) child_parent <- unname(split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE))) keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], combn2, encode_n) id <- unlist(int, use.names=FALSE) uid <- unique(xx) n <- tabulate(match(xx, uid), length(uid)) do.call(data.frame, c(decode(uid, encode_n), list(n=n))) }
leading to
> system.time(f3(parent_child)) user system elapsed 2.140 0.000 2.146
This is very beneficial (note that the time in the previous line corresponds to 10,000 parent-child relationships) with jlhoward's modified answer
> system.time(result.3 <- do.call("rbind",lapply(1:99,gg))) user system elapsed 2.465 0.000 2.468 > system.time(f3(parent_child[1:99])) user system elapsed 0.016 0.000 0.014
and scales much more intelligently.
For what it's worth, the data generation procedure is in the second round of Patrick Burn R Inferno, using the copy-and-append algorithm, and not pre-allocating space and filling it. Avoid this by writing for body of the loop as a function and using lapply. Avoid the need for a complex conditional in a for loop by fixing the problem before writing
numchild <- round(rnorm(10000, mean=30, sd=10)) numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
or by fetching from a distribution (rpois, rbinom) that generates positive integer values. Data generation then
n_parents <- 10000 numchild <- round(rnorm(n_parents, mean=30, sd=10)) numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0)) parent_child <- lapply(numchild, sample, x=n_parents)