Efficiently generate random graphs with a user-defined global clustering coefficient

I am working on a simulation of large-scale neural networks, for which I need to generate random graphs that represent the network topology.

I would like to be able to specify the following properties of these graphs:

  • The number of nodes, N (~ = 1000-10000)
  • The average probability of a connection between any two given nodes, p (~ 0.01-0.2)
  • Global clustering coefficient, C (~ 0.1-0.5)

Ideally, random graphs should be uniformly displayed from the set of all possible graphs that satisfy these user-defined criteria.

Right now I am using a very crude random diffusion method, when I start with a random Erdos-Renyi network with the desired size and global connection probability, then at each step I randomly re-read some part of the edges. If reconnecting brings me closer to the desired C, I keep the reconfigured network in the next iteration.

Here is my current Python implementation:

import igraph import numpy as np def generate_fixed_gcc(n, p, target_gcc, tol=1E-3): """ Creates an Erdos-Renyi random graph of size n with a specified global connection probability p, which is then iteratively rewired in order to achieve a user- specified global clustering coefficient. """ # initialize random graph G_best = igraph.Graph.Erdos_Renyi(n=n, p=p, directed=True, loops=False) loss_best = 1. n_edges = G_best.ecount() # start with a high rewiring rate rewiring_rate = n_edges n_iter = 0 while loss_best > tol: # operate on a copy of the current best graph G = G_best.copy() # adjust the number of connections to rewire according to the current # best loss n_rewire = min(max(int(rewiring_rate * loss_best), 1), n_edges) G.rewire(n=n_rewire) # compute the global clustering coefficient gcc = G.transitivity_undirected() loss = abs(gcc - target_gcc) # did we improve? if loss < loss_best: # keep the new graph G_best = G loss_best = loss gcc_best = gcc # increase the rewiring rate rewiring_rate *= 1.1 else: # reduce the rewiring rate rewiring_rate *= 0.9 n_iter += 1 # get adjacency matrix as a boolean numpy array M = np.array(G_best.get_adjacency().data, dtype=np.bool) return M, n_iter, gcc_best 

This works fine for small networks (N <500), but quickly becomes intractable as the number of nodes increases. It takes about 20 seconds to generate a 200 node graph and several days to generate a 1000 node graph.

Can anyone suggest an effective way to do this?

+7
python numpy random graph-theory igraph
source share
2 answers

Having done a bit of reading, it seems that the best solution might be a generalized version of the Gleeson algorithm presented in this article . However, I still do not understand how to implement this, so while I'm working on the Bansal et al algorithm .

Like my naive approach, this is Markov’s chain method, which uses random swaps, but, unlike mine, it specifically targets “triplet motifs” in the column for rewriting:

enter image description here

Since this will have a greater tendency to introduce triangles, it will therefore have a greater effect on the clustering coefficient. At least in the case of undirected graphs, the rewind step also ensures that the sequence of degrees is preserved. Again, at each repeated iteration, a new global clustering coefficient is measured, and a new graph is accepted if the GCC approaches the target value.

Bansal et al. Actually provided the Python implementation , but for various reasons I ended up writing my own version, which you can find here .

Performance

The Bansal approach takes just over half the number of iterations and half the total time compared to my naive diffusion method:

enter image description here

I was hoping for a profit increase, but 2x acceleration is better than nothing.

Oriented Graph Generalization

One remaining problem with the Bansal method is that my graphs are directed, while the algorithm of Bansal et al. It is intended only for work with undirected graphs. With an oriented schedule, the rewind step is no longer guaranteed to save sequences of inputs and outputs.


Update

I just figured out how to generalize the Bansal method to preserve both the sequences and the outer layers for directed graphs. The trick is to choose motives where the two outer edges to be exchanged have opposite directions (the directions of the edges between {x, y1} and {x, y2} don't matter):

enter image description here

I also made several optimizations, and the performance starts to look a little more respectable - it takes about half the number of iterations and half the total time compared to the diffusion approach. I updated the charts above with new timings.

+4
source share

You're right. This is a very expensive method to achieve what you want. I can only speculate if there is a mathematically healthy way to optimize and ensure that it is close to uniform distribution. I’m not even sure that your method leads to even distribution, although it seems to be so. Let me try:

Based on the docs for transitivity_undirected and wikipedia Clustering Coefficient , it seems that locally you can make changes on the graph and at the same time find out the exact effect on global connectivity and global clustering.

The global clustering coefficient is based on triples of nodes. A triplet consists of three nodes that are connected by two (open triplet) or three (closed triplet) undirected tacks. The triangle consists of three closed triplets, one of which is concentrated on each of the nodes. The global clustering factor is the number of closed triplets (or 3 x triangles) according to the total number of triplets (both open and closed).


(* edit *) Based on my reading of the article referenced by ali_m, the method below is likely to spend too many edges on clusters with a low degree of probability, which will lead to a graph that cannot reach the desired clustering coefficient if it is not very small (which would probably be useless). Therefore, in order to prevent someone from really using this, you will want to identify clusters of a higher degree in order to add edges in order to quickly increase the clustering coefficient without adding many edges.

On the other hand, the method below is consistent with the methods in the research work, therefore it is a more or less reasonable approach.


If I understood correctly, you could do the following:

  • Create a schedule of how you did it.

  • Calculate and track:

    • p_surplus to track the number of edges that need to be added or removed elsewhere in order to keep in touch.
    • cc_top , cc_btm to track the clustering coefficient
  • Iteratively (not fully) select random pairs and connect or disconnect them monotonously go to the Clustering Coefficient (cc) that you want, while maintaining compatibility (p).

    Pseudocode:

     for random_pair in random_pairs: if (random_pair is connected) and (need to reduce cc or p): # maybe put priority on the one that has a larger gap? delete the edge p_surplus -= 1 cc_top -= broken_connected_triplets # have to search locally cc_btm -= (broken_connected_triplets + broken_open_triplets) # have to search locally elif (random_pair is not connected) add (need to increase c or p): add the edge p_surplus += 1 cc_top += new_connected_triplets cc_btm += (new_connected_triplets + new_open_triplets) if cc and p are within desired ranges: done if some condition for detecting infinite loops: rethink this method 

This may not be entirely correct, but I think this approach will work. Searching for local triplets and always moving your parameters in the right direction will be better than copying a graph and global cc measurement so many times.

+4
source share

All Articles