Read / write network object GraphX

I'm trying to figure out a super-massive NetworkX Graph object with hundreds of millions of nodes. I would like to write it to a file so as not to consume all the computer memory. However, I need to constantly search for existing nodes, update edges, etc.

Is there a good solution for this? I'm not sure how it will work with any file format presented at http://networkx.lanl.gov/reference/readwrite.html

The only solution I can think of is to store each node as a separate file with links to other nodes in the file system - thus opening one node for verification does not overload the memory. Is there an existing file system for large amounts of data (for example, PyTables), so as not to write your own template code?

+8
python file-io networkx
source share
2 answers

If you built it as a NetworkX graph, it will already be in memory. I believe that for this large number of graphs you need to do something similar to what you suggested with separate files. But instead of using separate files, I would use a database to store each node with many-to-many connections between nodes. In other words, you will have a table of nodes and a table of edges, then to query for neighbors a specific node you can simply query for any edges that have this particular node at both ends. This should be quick, although I'm not sure that you can use NetworkX analysis functions without first building the entire network in memory.

+2
source share

First try pickle ; It is designed to serialize arbitrary objects.

Example of creating DiGraph and serializing to a file:

 import pickle import networkx as nx dg = nx.DiGraph() dg.add_edge('a','b') dg.add_edge('a','c') pickle.dump(dg, open('/tmp/graph.txt', 'w')) 

Example loading a DiGraph from a file:

 import pickle import networkx as nx dg = pickle.load(open('/tmp/graph.txt')) print dg.edges() 

Output:

 [('a', 'c'), ('a', 'b')] 

If this is not efficient enough, I would write my own procedure for serialization:

  • ribs and
  • (in case a node has no boundaries).

Note that using lists, when possible, can be much more efficient (instead of the standard for loops).

If this is not efficient enough, I would call the C ++ routine from Python: http://docs.python.org/extending/extending.html

+18
source share

All Articles