Using Python to generate a connection / network graph

I have a text file with approximately 8.5 million data points in the form:

Company 87178481 Company 893489 Company 2345788 [...] 

I want to use Python to create a connection diagram to see how the network looks between companies. From the above example, two companies would share an edge if the value in the second column were the same (clarification from / for Hooked ).

I used the NetworkX package and was able to create a network with several thousand points, but this did not lead to a full text file with 8.5 million node. I started it and left about 15 hours, and when I returned, the cursor in the shell was still blinking, but there was no graph of output.

Can it be assumed that it is still working? Is there a better / faster / easier approach for a graph of millions of points?

+6
source share
2 answers

If you have 1000K data points, you will need some way to look at the big picture. Depending on what you are looking for exactly, if you can assign a โ€œdistanceโ€ between companies (for example, the number of connections), you can visualize relationships (or clustering) using Dendrogram .

Scipy does clustering:

http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html#module-scipy.cluster.hierarchy

and has a function to turn them into dendrograms for visualization:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram

An example of a shortest path function through networkx :

http://networkx.lanl.gov/reference/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html#networkx.algorithms.shortest_paths.generic.shortest_path

Ultimately, you will need to decide how you want to weigh the distance between the two companies (peaks) on your chart.

+5
source

You have too much data, and if you visualized the network, this will not make any sense. You need to have ways to: 1) reduce the number of companies by deleting those that are less important / less related; 2) somehow generalize the schedule, and then visualize.

to reduce the size of the data, it would be better to create a network yourself (using your own code to create edgelist companies). Thus, you can reduce the size of your chart (for example, by deleting single lists, which can be many).

To summarize, I recommend running a clustering or community discovery algorithm. This can be done very quickly even for very large networks. Use the "fastgreedy" method in the igraph package: http://igraph.sourceforge.net/doc/R/fastgreedy.community.html (there is also a faster algorithm available on the Internet, this is Blondel et al: http: // perso. uclouvain.be/vincent.blondel/publications/08BG.pdf I know their code is available on the Internet somewhere)

+4
source

All Articles