Huge graphics structure

Question

Huge graphics structure

I am developing an application in which I need a structure to represent a huge graph (between 1,000,000 and 6,000,000 nodes and 100 or 600 edges per node) in memory. The edges view will contain some relationship attributes.

I tried representing memory cards, arrays, dictionaries, and strings to represent this structure in memory, but they always crash due to memory limitations.

I would like some advice on how I can present this, or something like that.

By the way, I am using python.

+6

python memory data-structures graph

Harph May 10, '10 at 22:10

source share

8 answers

tster · Answer 1 · 2010-05-10T22:23:10+0000

If it's 100-600 edges / node, then you're talking about 3.6 billion edges.
Why should all this be remembered?
Can you show us the structures that you are currently using?
How much memory have we allowed (what is the limit of the memory you click on?)

If you need only one reason, you need to quickly read and write, then use the database. Databases read and write very quickly, often they can read without going to disk.

Binary nerd · Answer 2 · 2010-05-12T00:39:20+0000

Depending on your hardware resources, everything in memory for a graph of this size is probably out of the question. Two possible parameters from the point of view of the database of a particular schedule are:

Neo4j - claims to easily handle billions of nodes and is in development for a long time.
FlockDB - Recently released by Twitter, it is a distributed graph database.

Since you are using Python, have you looked at Networkx ? How much did you load a chart of this size if you were not interested?

Cade roux · Answer 3 · 2010-05-10T22:29:34+0000

I doubt that you can use the memory structure if you do not have a lot of memory at your disposal:

Suppose you are talking about 600 directed edges from each node, with the node being a 4-byte (integer key) and the directed edge being the destination JUST (4 bytes each).

Then the raw data for each node is 4 + 600 * 4 = 2404 bytes x 6,000,000 = more than 14.4 GB

This is without any other overhead or any additional data in the nodes (or edges).

Michael madsen · Answer 4 · 2010-05-10T22:23:26+0000

It looks like you have very few edges, given the number of nodes, which indicates that most nodes are not strictly necessary. So, instead of actually storing all the nodes, why not use a sparse structure and insert them when they are in use? This should be pretty easy to do with a dictionary; just don’t paste the node until you use it for the edge.

Edges can be saved using an adjacency list on nodes.

Of course, this only applies if you really mean 100-600 nodes. If you mean node, this is a completely different story.

amwinter · Answer 5 · 2016-02-28T16:23:35+0000

The scipy.sparse.csgraph package can handle this - 5 million nodes * 100 edges average 500 million pairs, with 8 bytes per pair (two integer identifiers) is about 4 GB. I think csgraph uses compression, so it will use less memory than that; It may work on your laptop.

csgraph does not have as many functions as networkx, but waaay uses less memory.

wisty · Answer 6 · 2010-05-11T09:15:53+0000

Assuming you have a value of 600 on node, you can try something like this:

 import os.path import cPickle class LazyGraph: def __init__(self,folder): self.folder = folder def get_node(self,id): f = open(os.path.join(self.folder,str(id)),'rb') node = cPickle.load(f) f.close() # just being paranoid return node def set_node(self,id,node): f = open(os.path.join(self.folder,str(id)),'wb') cPickle.dump(node,f,-1) # use highest protocol f.close() # just being paranoid

Use arrays (or numpy arrays) to store the actual node identifiers as they are faster.

Please note: this will be very slow.

You can use threads to pre-fetch nodes (provided that you know which order you were processing them for), but it will not be fun.

exupero · Answer 7 · 2010-05-10T22:34:54+0000

It looks like you need a database and an iterator based on the results. Then you do not have to keep it all in mind at the same time, but you can always access it.

miles82 · Answer 8 · 2010-05-11T08:01:27+0000

If you still decide to use some kind of database, I suggest looking at neo4j and its bindings to python. This is a graph database capable of processing large graphs. Here's a presentation from this year on PyCon.

Huge graphics structure

More articles: