Optimize the graph plotting algorithm based on node weights

I am trying to improve the function of building a network based on an estimate calculated from some node attributes. The function tries to find the best subnet from the graph maximizing the product of node attributes.

The function runs in a random node and starts searching in the first socket, if there are some neighbors whose node index is sufficient for the threshold, the / s neighbor is added to the first node, and the process continues until more is added (adding a neighbor will not give the desired gain in the account) . If there is no node in the first neighbors, which gives an increase in the estimate, then the function looks at the neighbors of the second degree. In this situation, it is very likely that there are several ways to connect a node (a second-degree neighbor), in this particular case, the selected path will be the shortest with the highest weight (one of the node attributes).

I could do code parallelization, although I don’t know how to implement it in this type of function.

The function is as follows:

build_network <- function (G, seed, d= 2){ net <- G d <- d score.fun<-function(g){ Za <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2)) k <- vcount(g) tmp <- genesets.length.null.stat[[as.character(k)]] # genesets.length.null.stat is a list with the median of Za and sd of Za calculated for 1000 replicates of networks of size k Sa <- (Za-tmp[1])/tmp[2] } best.fun<-function(in.nodes,out.nodes) { score<-(-Inf); best<-character() for(node in out.nodes){ subG.update<-induced.subgraph(net, c(in.nodes,node)) if( score.fun(subG.update) > score ){ score<-score.fun(subG.update) best<-node } } list("node"=best,"score"=score) } subG <- induced.subgraph(net, seed) if (!is.connected(subG)) { #the seed must be connected stop("Input seeds are disjoint") } while (TRUE) { in.nodes <- V(subG)$name node_num <- vcount(subG) subsum <- score.fun(subG) #subx <- V(subG)$name for (rad in 1:d) { tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = V(subG)$name)) pot.nodes <- V(net)[tmp.neigh]$name out.nodes <- setdiff(pot.nodes, in.nodes) if (length(out.nodes) == 0) break best_node<-best.fun(in.nodes, out.nodes) new_score<-best_node$score best_node<-best_node$node if (new_score > subsum + 0.01) { tmp <- unlist(lapply(best_node, function(x) node2treePath(net,V(subG)$name, x))) # node2treePath is a function to retrieve the shortest path with the highest node weights in.nodes <- c(tmp, V(subG)$name) subG <- induced.subgraph(net, in.nodes) break } } if (node_num == vcount(subG)) break } return(subG) } 

I am trying to apply this function to a graph of ~ 10,000 nodes. There will be an approximation of the code to run the function

 ### generate some example data library(igraph) my_graph <- erdos.renyi.game(10000, 0.0003) V(my_graph)$name <- 1:vcount(my_graph) V(my_graph)$weight <- rnorm(10000) V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05) ### Run the function sublist = list() for (node in V(G)$name) { subnet <- build_network(G, node, d) sublist[[node]] <- subnet } 

EDIT: here dput of head(genesets.length.null.stat)

 structure(list(`1` = c(1.01397367504035, 1.18858228819048), `2` = c(1.61970348041337, 1.30189433386605), `3` = c(2.11767222957028, 1.36222065695878), `4` = c(2.47710421934929, 1.36968129959296), `5` = c(2.776011866622, 1.36318885187196), `6` = c(3.16885126246671, 1.42577861995897)), .Names = c("1", "2", "3", "4", "5", "6")) 

Here is the node2treePath function:

 node2treePath <- function (G, Tnodes, node){ tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res tmp.l <- unlist(lapply(tmp.path, length)) index <- which(tmp.l == min(tmp.l)) tmp.path = tmp.path[index] tmp.sum <- unlist(lapply(tmp.path, function(x)return(sum(V(G)[x]$weight)))) index <- which(tmp.sum == max(tmp.sum)) selected.path = tmp.path[index] collect <- unlist(lapply(selected.path, function(x)return(V(G)[x]$name))) return(collect) } 
+7
algorithm r igraph
source share
3 answers

For the logic you want to do (and I suppose you can change it in a way that is incompatible with the answers above), the following code is about ten times 30% faster. I used Rprof and profr and transcoded some slow bits in trivial ways, for example. not passing a couple with a named list, but just an anonymous couple from one of your functions. A numerically named list with value pairs for genesets.length.null.stat very inefficient. I replaced it with two number vectors. You also call the "V" function a lot, which has been a long-time consumer: as you can see, you can call it once and then query for the result as needed.

 # node2treePath is a function to retrieve the shortest path with the highest node weights node2treePath_jw <- function(G, Tnodes, node){ tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res tmp.l <- vapply(tmp.path, length, integer(1)) index <- which(tmp.l == min(tmp.l)) tmp.path = tmp.path[index] Vg <- V(G) tmp.sum <- vapply(tmp.path, function(x) sum(Vg[x]$weight), numeric(1)) index <- which(tmp.sum == max(tmp.sum)) selected.path = tmp.path[index] sapply(selected.path, function(x) Vg[x]$name) } build_network_jw <- function(net, seed, d= 2){ score.fun <- function(Vg, k){ Za <- sum(Vg$weight * Vg$RWRNodeweight) / sqrt(sum(Vg$RWRNodeweight^2)) (Za - genesets_jack_a[k]) / genesets_jack_b[k] } best.fun_jw <- function(in.nodes, out.nodes) { score <- (-Inf) best <- character() for (node in out.nodes) { subG.update <- induced.subgraph(net, c(in.nodes,node)) Vsgu <- V(subG.update) Vsgu_count <- vcount(subG.update) sf <- score.fun(Vsgu, Vsgu_count) if (sf > score) { score <- sf best <- node } } list(best, score) } subG <- induced.subgraph(net, seed) if (!is.connected(subG)) { #the seed must be connected stop("Input seeds are disjoint") } while (TRUE) { VsubG <- V(subG) Vnet <- V(net) in.nodes <- VsubG$name node_num <- vcount(subG) subsum <- score.fun(VsubG, node_num) for (rad in 1:d) { # d = 2 tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = VsubG$name)) pot.nodes <- Vnet[tmp.neigh]$name out.nodes <- setdiff(pot.nodes, in.nodes) if (length(out.nodes) == 0) break best_node <- best.fun_jw(in.nodes, out.nodes) new_score <- best_node[[2]] best_node <- best_node[[1]] if (new_score > subsum + 0.01) { tmp <- sapply(best_node, function(x) node2treePath_jw(net, VsubG$name, x)) in.nodes <- c(tmp, VsubG$name) subG <- induced.subgraph(net, in.nodes) break } } if (node_num == vcount(subG)) break } subG } node2treePath <- function (G, Tnodes, node){ tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res tmp.l <- unlist(lapply(tmp.path, length)) index <- which(tmp.l == min(tmp.l)) tmp.path = tmp.path[index] tmp.sum <- unlist(lapply(tmp.path, function(x)return(sum(V(G)[x]$weight)))) index <- which(tmp.sum == max(tmp.sum)) selected.path = tmp.path[index] collect <- unlist(lapply(selected.path, function(x)return(V(G)[x]$name))) return(collect) } build_network <- function (net, seed, d= 2){ #genesets.length.null.stat <- structure(list(`1` = c(1.01397367504035, 1.18858228819048), `2` = c(1.61970348041337, 1.30189433386605), `3` = c(2.11767222957028, 1.36222065695878), `4` = c(2.47710421934929, 1.36968129959296), `5` = c(2.776011866622, 1.36318885187196), `6` = c(3.16885126246671, 1.42577861995897)), .Names = c("1", "2", "3", "4", "5", "6")) genesets.length.null.stat <- lapply(1:500, function(x) c(runif(1)+x, runif(1)+x)) names(genesets.length.null.stat) <- 1:500 score.fun<-function(g){ Za <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2)) k <- vcount(g) tmp <- genesets.length.null.stat[[as.character(k)]] # genesets.length.null.stat is a list with the median of Za and sd of Za calculated for 1000 replicates of networks of size k Sa <- (Za-tmp[1])/tmp[2] } best.fun <- function(in.nodes,out.nodes) { score<-(-Inf); best<-character() for (node in out.nodes){ subG.update<-induced.subgraph(net, c(in.nodes,node)) if (score.fun(subG.update) > score) { score<-score.fun(subG.update) best<-node } } list("node"=best,"score"=score) } subG <- induced.subgraph(net, seed) if (!is.connected(subG)) { #the seed must be connected stop("Input seeds are disjoint") } while (TRUE) { in.nodes <- V(subG)$name node_num <- vcount(subG) subsum <- score.fun(subG) #subx <- V(subG)$name for (rad in 1:d) { tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = V(subG)$name)) pot.nodes <- V(net)[tmp.neigh]$name out.nodes <- setdiff(pot.nodes, in.nodes) if (length(out.nodes) == 0) break #message("length in.nodes = ", length(in.nodes)) #message("length out.nodes = ", length(out.nodes)) best_node<-best.fun(in.nodes, out.nodes) new_score<-best_node$score best_node<-best_node$node if (new_score > subsum + 0.01) { tmp <- unlist(lapply(best_node, function(x) node2treePath(net,V(subG)$name, x))) # node2treePath is a function to retrieve the shortest path with the highest node weights in.nodes <- c(tmp, V(subG)$name) subG <- induced.subgraph(net, in.nodes) break } } if (node_num == vcount(subG)) break } subG } library(igraph) library(profr) library(igraph) library(profr) #genesets.length.null.stat <- lapply(1:500, function(x) c(runif(1)+x, runif(1)+x)) #names(genesets.length.null.stat) <- 1:500 set.seed(1) genesets_jack_a = runif(500) + 1:500 genesets_jack_b = runif(500) + 1:500 do_it_jw <- function(n = 1000){ my_graph <- erdos.renyi.game(n, 0.0003) V(my_graph)$name <- 1:vcount(my_graph) V(my_graph)$weight <- rnorm(n) V(my_graph)$RWRNodeweight <- runif(n, min = 0, max = 0.05) ### Run the function sublist = list() Vmg <- V(my_graph) for (node in Vmg$name) { #message(node) subnet <- build_network_jw(my_graph, node, 2) sublist[[node]] <- subnet } } do_it <- function(n = 1000){ my_graph <- erdos.renyi.game(n, 0.0003) V(my_graph)$name <- 1:vcount(my_graph) V(my_graph)$weight <- rnorm(n) V(my_graph)$RWRNodeweight <- runif(n, min = 0, max = 0.05) ### Run the function sublist = list() Vmg <- V(my_graph) for (node in Vmg$name) { #message(node) subnet <- build_network(my_graph, node, 2) sublist[[node]] <- subnet } } library(microbenchmark) mb <- microbenchmark(do_it(1000), do_it_jw(1000), times = 5) print(mb) 
0
source share

Since your evaluation function depends only on the node attributes, not the edge, the solution is not unique; you can search instead for the best tree. If you restructure your problem so that your nodes are edges and vice versa, you can probably just use the Djikstra algorithm to find the best one. This is already in the igraph package as shortest.paths() .

0
source share

I can’t read the R code, but based on your description: If the evaluation threshold is constant, then this is easy to do in O (| V | + | E | + | C | ^ 2) time, where | C | this is the number of “good” components (this will be explained later).

In the first pass, delete all nodes with an account below the threshold value. Then find all the related components in this new graph (this can be done in O (| V | + | E |) by running DFS in each yet invisible node), calculate their points, multiplying all the vertices of the weight in the component and mark each vertex with your own component identifier. This already tells you about the "good" components - those that do not require any 2nd degree connections.

Suppose this gives | C | Components. Create an empty hash table H that has pairs of component identifiers for keys and pairs (length, weight) for values. Now go back to each vertex v that you deleted in the first pass: for each of them, look at all your neighbors and write down the shortest edge for each individual component (this can be done using the length- | C | array to store the shortest edge to each component observed so far). After examining all the v-neighbors, count the number k of different components that they fall into: if k> = 2, then v should potentially be used to connect some of these k (k-1) / 2 pairs of components. For each pair of different components i and j that v can connect, update H with the weight and distance of this two-way connection as needed: that is, if i and j are not connected yet, then write down that v joins them; otherwise, if they are already connected by some vertex u, update only H if v can do better (i.e. if it uses a shorter overall length and more weight than u). This step can be seen as creating a minimal spanning tree in the "component graph" obtained from the original, cropped graph. The grades for each new “combined” component can easily be calculated as you go by simply multiplying the grades of the two component components.

Finally, simply return the component whose product is maximum.

0
source share

All Articles