Of the main connected components, the algorithms

I have 4,000,000,000 (four billion) edges for an undirected graph. They are presented in a large text file as pairs of node identifiers. I would like to calculate the related components of this graph. Unfortunately, as soon as you load the node IDs with edges into memory, it takes up more than 128 GB of RAM, which I have.

Is there any external algorithm for finding connected components that is relatively easy to implement? Or, better yet, can this be done with Unix command tools and existing (python) libraries?

+6
source share
2 answers

Based on the description of the problem you provided and the answers you provided in the comments, I think the easiest way to do this could be to use an approach like the one described by @dreamzor. Here's a more complicated version of this answer.

The main idea is to convert the data to a more compressed format that fits into memory, run the algorithm of the regularly connected components on this data, and then unpack it. Note that if you assign a 32-bit numeric identifier to each node, the total space required to store all nodes is no more than four billion nodes and eight billion edges (assuming you keep two copies of each edge) which is the space for twelve billion 32-bit integers, only about 48 GB of space, below the memory threshold.

To get started, write a script that reads in the edge file, assigns a numeric identifier to each node (possibly sequentially in the order in which they appear). Ask this script to write this mapping to a file and, as they say, write a new edge file that uses numeric identifiers of nodes, not line names. When you are done, you will have name matching identifiers for names and an edge file that takes up much less space than before. You mentioned in the comments that you can put all the node names in memory, so this step should be very reasonable. Note that you do not need to store all the edges in memory - you can pass them through the program - so this should not be a bottleneck.

Then write a program that reads the ribs file — but not the names file — into memory and finds the connected components using any reasonable algorithm (BFS or DFS would be great here). If you are careful with your memory (using something like C or C ++ here will be a good call), this should fit comfortably into main memory. When everything is ready, write all the clusters to an external file using a numerical identifier. Now you have a list of all CCs by ID.

Finally, write a program that reads the mapping from the name file in the ID to node, then passes it to the cluster identifiers and writes the names of all the nodes in each cluster to the destination file.

This approach should be relatively easy to implement, because the key idea is to preserve the existing algorithms that you are used to, but just change the presentation of the graph to increase memory efficiency. I used such approaches as before when I was dealing with huge graphs (Wikipedia), and it worked perfectly even on systems with less memory than yours.

+4
source

You can hold only the vertex array as your "color" (int value), and then run the file without loading the entire set of links, mark the vertices with color, new if no vertex is colored, the same color if it is colored and the other is not , and the lowest of the two colors along with repainting all the other vertices in the array that are colored with the highest color, if both are colored. Pseudo-code example:

int nextColor=1; int merges=0; int[] vertices; while (!file.eof()) { link=file.readLink(); c1=vertices[link.a]; c2=vertices[link.b]; if ((c1==0)&&(c2==0)) { vertices[link.a]=nextColor; vertices[link.b]=nextColor; nextColor++; } else if ((c1!=0)&&(c2!=0)) { // both colored, merge for (i=vertices.length-1;i>=0;i--) if (vertices[i]==c2) vertices[i]=c1; merges++; } else if (c1==0) vertices[link.a]=c2; // only c1 is 0 else vertices[link.b]=c1; // only c2 is 0 } 

If you choose a 32-bit type for storing the color of the vertex, you may need to first check to see if nextColor is nextColor , have an unused array (freed up in the merge) and skip coloring the new set of two vertices if the color cannot be used, then re-run the process of reading files if both colors are used and any merging occurs.

UPDATE: since the vertices are not really ints, but strings instead, you should also have a line map for int when parsing this file. If your lines are limited in length, you can put them all in memory as a hash table, but I pre-processed the file by creating another file in which all the lines "s1" are replaced by "1", "s2" "with" 2 "etc., where" s1 "," s2 "are any names displayed as vertices in the file, so the data will be compressed into a list of int pairs. In case you will process such data later (i.e. e. Your graph does not change much and contains basically the same vertex names, save the "metadata" file with links from names to ints to facilitate further preliminary processing swelling.

+1
source

All Articles