Clauset-Newman-Moore Community Implementation

Question

Clauset-Newman-Moore Community Implementation

I am trying to implement the above community discovery algorithm in Java, and although I have access to the C ++ code and the original document, I cannot get it to work at all. My main problem is that I do not understand the purpose of the code, that is, how the algorithm works. From a practical point of view, my code is stuck in what seems like an endless loop in mergeBestQ , the heap list seems to get bigger at each iteration (as I would expect from the code), but the topQ value always returns the same value.

The graph I'm testing is quite large (300,000 nodes, 650,000 edges). The source code that I use for my implementation is from the SNAP library ( https://github.com/snap-stanford/snap/blob/master/snap-core/cmty.cpp ). It would be great if someone could explain the intuition of the algorithm to me, it seems to initially establish that each node should be in its own community, and then write down the modularity value (regardless of what it is) of each pair of connected nodes on the graph, then finding a pair of nodes that have the highest modularity and move them to the same community. Also, if someone could provide some mid-level pseudo code, that would be great. Here is my implementation so far, I tried to save it in one file for the sake of brevity, however CommunityGraph and CommunityNode are in a different place (not necessary). The graph maintains a list of all nodes, and each node maintains a list of its connections to other nodes. When launched, it never passes the while(this.mergeBestQ()){} line while(this.mergeBestQ()){}

UPDATE - found some errors in my code after a thorough analysis. The code now completes VERY quickly, but does not fully implement the algorithm, for example, 300,000 nodes on the graph; it claims to be about 299,000 communities (i.e., approximately 1 node per community). I have listed the updated code below. /// Clauset -Newman-Moore community discovery method. /// At each step, two communities are combined that contribute the maximum positive value to global modularity. /// See: Searching for community structures in very large networks, A. Clauset, MEJ Newman, C. Moore, 2004 open class CNMMCommunityMetric implements CommunityMetric {private static class DoubleIntInt implements Comparable {public double val1; public int val2; public int val3; DoubleIntInt (double val1, int val2, int val3) {this.val1 = val1; this.val2 = val2; this.val3 = val3; }

  @Override public int compareTo(DoubleIntInt o) { //int this_sum = this.val2 + this.val3; //int oth_sum = o.val2 + o.val3; if(this.equals(o)){ return 0; } else if(val1 < o.val1 || (val1 == o.val1 && val2 < o.val2) || (val1 == o.val1 && val2 == o.val2 && val3 < o.val3)){ return 1; } else{ return -1; } //return this.val1 < o.val1 ? 1 : (this.val1 > o.val1 ? -1 : this_sum - oth_sum); } @Override public boolean equals(Object o){ return this.val2 == ((DoubleIntInt)o).val2 && this.val3 == ((DoubleIntInt)o).val3; } @Override public int hashCode() { int hash = 3; hash = 79 * hash + this.val2; hash = 79 * hash + this.val3; return hash; } } private static class CommunityData { double DegFrac; TIntDoubleHashMap nodeToQ = new TIntDoubleHashMap(); int maxQId; CommunityData(){ maxQId = -1; } CommunityData(double nodeDegFrac, int outDeg){ DegFrac = nodeDegFrac; maxQId = -1; } void addQ(int NId, double Q) { nodeToQ.put(NId, Q); if (maxQId == -1 || nodeToQ.get(maxQId) < Q) { maxQId = NId; } } void updateMaxQ() { maxQId=-1; int[] nodeIDs = nodeToQ.keys(); double maxQ = nodeToQ.get(maxQId); for(int i = 0; i < nodeIDs.length; i++){ int id = nodeIDs[i]; if(maxQId == -1 || maxQ < nodeToQ.get(id)){ maxQId = id; maxQ = nodeToQ.get(maxQId); } } } void delLink(int K) { int NId=getMxQNId(); nodeToQ.remove(K); if (NId == K) { updateMaxQ(); } } int getMxQNId() { return maxQId; } double getMxQ() { return nodeToQ.get(maxQId); } }; private TIntObjectHashMap<CommunityData> communityData = new TIntObjectHashMap<CommunityData>(); private TreeSet<DoubleIntInt> heap = new TreeSet<DoubleIntInt>(); private HashMap<DoubleIntInt,DoubleIntInt> set = new HashMap<DoubleIntInt,DoubleIntInt>(); private double Q = 0.0; private UnionFind uf = new UnionFind(); @Override public double getCommunities(CommunityGraph graph) { init(graph); //CNMMCommunityMetric metric = new CNMMCommunityMetric(); //metric.getCommunities(graph); // maximize modularity while (this.mergeBestQ(graph)) { } // reconstruct communities HashMap<Integer, ArrayList<Integer>> IdCmtyH = new HashMap<Integer, ArrayList<Integer>>(); Iterator<CommunityNode> ns = graph.getNodes(); int community = 0; TIntIntHashMap communities = new TIntIntHashMap(); while(ns.hasNext()){ CommunityNode n = ns.next(); int r = uf.find(n); if(!communities.contains(r)){ communities.put(r, community++); } n.setCommunity(communities.get(r)); } System.exit(0); return this.Q; } private void init(Graph graph) { double M = 0.5/graph.getEdgesList().size(); Iterator<Node> ns = graph.getNodes(); while(ns.hasNext()){ Node n = ns.next(); uf.add(n); int edges = n.getEdgesList().size(); if(edges == 0){ continue; } CommunityData dat = new CommunityData(M * edges, edges); communityData.put(n.getId(), dat); Iterator<Edge> es = n.getConnections(); while(es.hasNext()){ Edge e = es.next(); Node dest = e.getStart() == n ? e.getEnd() : e.getStart(); double dstMod = 2 * M * (1.0 - edges * dest.getEdgesList().size() * M);//(1 / (2 * M)) - ((n.getEdgesList().size() * dest.getEdgesList().size()) / ((2 * M) * (2 * M)));// * (1.0 - edges * dest.getEdgesList().size() * M); dat.addQ(dest.getId(), dstMod); } Q += -1.0 * (edges*M) * (edges*M); if(n.getId() < dat.getMxQNId()){ addToHeap(createEdge(dat.getMxQ(), n.getId(), dat.getMxQNId())); } } } void addToHeap(DoubleIntInt o){ heap.add(o); } DoubleIntInt createEdge(double val1, int val2, int val3){ DoubleIntInt n = new DoubleIntInt(val1, val2, val3); if(set.containsKey(n)){ DoubleIntInt n1 = set.get(n); heap.remove(n1); if(n1.val1 < val1){ n1.val1 = val1; } n = n1; } else{ set.put(n, n); } return n; } void removeFromHeap(Collection<DoubleIntInt> col, DoubleIntInt o){ //set.remove(o); col.remove(o); } DoubleIntInt findMxQEdge() { while (true) { if (heap.isEmpty()) { break; } DoubleIntInt topQ = heap.first(); removeFromHeap(heap, topQ); //heap.remove(topQ); if (!communityData.containsKey(topQ.val2) || ! communityData.containsKey(topQ.val3)) { continue; } if (topQ.val1 != communityData.get(topQ.val2).getMxQ() && topQ.val1 != communityData.get(topQ.val3).getMxQ()) { continue; } return topQ; } return new DoubleIntInt(-1.0, -1, -1); } boolean mergeBestQ(Graph graph) { DoubleIntInt topQ = findMxQEdge(); if (topQ.val1 <= 0.0) { return false; } // joint communities int i = topQ.val3; int j = topQ.val2; uf.union(i, j); Q += topQ.val1; CommunityData datJ = communityData.get(j); CommunityData datI = communityData.get(i); datI.delLink(j); datJ.delLink(i); int[] datJData = datJ.nodeToQ.keys(); for(int _k = 0; _k < datJData.length; _k++){ int k = datJData[_k]; CommunityData datK = communityData.get(k); double newQ = datJ.nodeToQ.get(k); //if(datJ.nodeToQ.containsKey(i)){ // newQ = datJ.nodeToQ.get(i); //} if (datI.nodeToQ.containsKey(k)) { newQ = newQ + datI.nodeToQ.get(k); datK.delLink(i); } // K connected to I and J else { newQ = newQ - 2 * datI.DegFrac * datK.DegFrac; } // K connected to J not I datJ.addQ(k, newQ); datK.addQ(j, newQ); addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k))); } int[] datIData = datI.nodeToQ.keys(); for(int _k = 0; _k < datIData.length; _k++){ int k = datIData[_k]; if (!datJ.nodeToQ.containsKey(k)) { // K connected to I not J CommunityData datK = communityData.get(k); double newQ = datI.nodeToQ.get(k) - 2 * datJ.DegFrac * datK.DegFrac; datJ.addQ(k, newQ); datK.delLink(i); datK.addQ(j, newQ); addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k))); } } datJ.DegFrac += datI.DegFrac; if (datJ.nodeToQ.isEmpty()) { communityData.remove(j); } // isolated community (done) communityData.remove(i); return true; } }

UPDATE: the current code is pretty fast and has half the memory usage compared to the “fastest” solution, but only ~ 5% slower. the difference lies in using hashmap + treest vs priority queue and providing only one object for a given pair i, j exists at any time.

+7

java c ++ algorithm modularity

Zack newsham Dec 7 '13 at 15:57

source share

1 answer

Andy jones · Accepted Answer · 2013-12-11T12:52:56+0000

So, here is the original article, a neat six-page page, only two of which relate to design and implementation. Here are cliffnotes:

To partition a given graph, the authors define the modularity of section Q this section as the ratio of the number of edges within each community to the number of edges between each community, minus the ratio, d to expect from a completely random partition.
So, it’s effective “how much better is this section in defining communities than completely random?”
Given the two communities i and j section, they then determine deltaQ_ij how much the modularity of the section will change if the communities i and j are combined. Therefore, if deltaQ_ij > 0 , combining i and j will improve the modularity of the partition.
This leads to a simple greedy algorithm: start with each node in your own community. Calculate deltaQ_ij for each pair of communities. Whatever two communities i, j have the greatest deltaQ_ij , combine these two. Repeat.
You will get maximum modularity when deltaQ_ij everything turns negative, but the authors allow the algorithm to work in the document until there is only one community left.

This is largely for understanding the algorithm. Details on how to quickly calculate deltaQ_ij and store information efficiently.

Edit: data structure time!

So, firstly, I think that the implementation you are referring to does everything differently. I'm not quite sure how, because the code is impenetrable, but it seems to use union-find and hashsets instead of the author's binary trees and several heaps. I don’t know why they do it differently. You might want to email the guy who wrote it and ask.

In any case, the algorithm in the document needs several things from the deltaQ format, which is stored in:

First, it should be able to quickly restore the highest value in dQ .
Secondly, it should be able to quickly remove all deltaQ_ik and deltaQ_ki for fixed i .
Thirdly, it should be able to quickly update all deltaQ_kj and deltaQ_jk for fixed j .

The solution that the authors will come up with for this is as follows:

For each community i each non-zero deltaQ_ik is stored in a balanced binary tree , indexed k (therefore, elements can be easily found) and on the heap (therefore, the maximum for this community can be found easily).
The maximum deltaQ_ik from each heap of community i then stored in another heap, so common maxima can be easily found.

When community i merges with community j , several things happen to binary trees:

First, each element from the i th community is added to the binary tree of the j th community. If an element with the same index k already exists, you sum the old and new values.
Secondly, we update all remaining “old” values in the binary tree of the j th community to reflect the fact that the j th community has just grown.
And for every other binary community tree k we update any deltaQ_kj .
Finally, the tree for community i selected.

And similarly, a few things should happen to heaps:

First, the heap for community i thrown away.
Then the heap for community j restored from scratch using elements from the balanced binary community tree.
And for each community k heap, the record position deltaQ_kj .
Finally, the community i record is thrown into the heap (causes bubbles), and the records for community j and each community k connected to i or j are updated.

Oddly enough, when the two communities are combined, there is no reference in the document to deleting deltaQ_ki values from the heap or community tree k th. I think this may be due to setting a_i = 0 , but I do not understand the algorithm well enough to be sure.

Edit: An attempt to decrypt your implementation. Their primary data structures

CmtyIdUF , a union-find structure that keeps track of which nodes are in the community (something that was ignored in the document, but seems necessary if you do not want to restore community membership from the merge trace or something else)
MxQHeap , the heap for tracking which deltaQ_ij is the largest overall. Oddly enough, when they update the value of a TFltIntIntTr in the heap, they do not request the heap to re-enumerate themselves. This is exciting. Does it do it automatically or something like that?
CmtyQH , a hash map that maps community identifier i to the TCmtyDat structure that contains that bunch of deltaQ_ik for this community. I think. Oddly, however, the UpdateMaxQ structure of TCmtyDat takes linear time, eliminating any heap need. Moreover, the UpdateMaxQ method UpdateMaxQ called only when a heap item is deleted. It should also be called when the value of any item in the heap is updated.

Clauset-Newman-Moore Community Implementation

More articles: