Based on the description of the problem you provided and the answers you provided in the comments, I think the easiest way to do this could be to use an approach like the one described by @dreamzor. Here's a more complicated version of this answer.
The main idea is to convert the data to a more compressed format that fits into memory, run the algorithm of the regularly connected components on this data, and then unpack it. Note that if you assign a 32-bit numeric identifier to each node, the total space required to store all nodes is no more than four billion nodes and eight billion edges (assuming you keep two copies of each edge) which is the space for twelve billion 32-bit integers, only about 48 GB of space, below the memory threshold.
To get started, write a script that reads in the edge file, assigns a numeric identifier to each node (possibly sequentially in the order in which they appear). Ask this script to write this mapping to a file and, as they say, write a new edge file that uses numeric identifiers of nodes, not line names. When you are done, you will have name matching identifiers for names and an edge file that takes up much less space than before. You mentioned in the comments that you can put all the node names in memory, so this step should be very reasonable. Note that you do not need to store all the edges in memory - you can pass them through the program - so this should not be a bottleneck.
Then write a program that reads the ribs file — but not the names file — into memory and finds the connected components using any reasonable algorithm (BFS or DFS would be great here). If you are careful with your memory (using something like C or C ++ here will be a good call), this should fit comfortably into main memory. When everything is ready, write all the clusters to an external file using a numerical identifier. Now you have a list of all CCs by ID.
Finally, write a program that reads the mapping from the name file in the ID to node, then passes it to the cluster identifiers and writes the names of all the nodes in each cluster to the destination file.
This approach should be relatively easy to implement, because the key idea is to preserve the existing algorithms that you are used to, but just change the presentation of the graph to increase memory efficiency. I used such approaches as before when I was dealing with huge graphs (Wikipedia), and it worked perfectly even on systems with less memory than yours.
source share