Saving graphs in Spark Graphx with HDFS

Question

Saving graphs in Spark Graphx with HDFS

I built a graph in Spark GraphX. This graph will have a potential of 1 billion nodes and more than 10 billion edges, so I do not want to build this graph again and again.

I want to be able to create it once, save it (I think it's best in HDFS), start some processes on it, and then access it after a couple of days or weeks, add new nodes and edges, and start a few more processes on it.

How can I do this in Apache Spark GraphX?

EDIT: I think I found a potential solution, but I would like someone to confirm that this is the best way.

If I have a graph, say graph , I have to store the graph by its vertexRDD and its edgeRDD separately in a text file. Then, later in time, I can access these text files, for example:

 graph.vertices.saveAsTextFile(somePath) graph.edges.saveAsTextFile(somePath)

Now I have one question: should I use saveAsTextFile () or saveAsObjectFile ()? And then, how should I access this file later?

+4

apache-spark spark-graphx

edenmark Aug 4 '15 at 6:54

source share

2 answers

Gaurav kumar · Answer 1 · 2015-11-13T14:04:06+0000

GraphX does not yet have a graph-saving mechanism. Therefore, the next best thing to do is save both edges and vertices and build a graph from that. If your vertices are complex in nature, you should use sequence files to save them.

  vertices.saveAsObjectFile("location/of/vertices") edges.saveAsObjectFile("location/of/edges")

And later you can read from disk and plot.

 val vertices = sc.objectFile[T]("/location/of/vertices") val edges = sc.objectFile[T]("/location/of/edges") val graph = Graph(vertices, edges)

Bradreees · Answer 2 · 2015-08-07T20:08:41+0000

As you mentioned, you need to save the edge and possibly the vertex data. The question is whether you use custom vertex or boundary classes. If there are no attributes on the edges or vertices, you can simply save the edge file and recreate the graph from this. A simple example of using GraphLoader would be:

 graph.edges.saveAsTextFile(path) ... val myGraph = GraphLoader.edgeListFile(path)

The only problem is that GraphLoader.edgeListFile returns a graph of [Int, Int], which can be a problem for large graphs. Once you hit the billions, you will do something like:

 graph.edges.saveAsTextFile(path) graph.vertices.saveAsTextFile(path) .... val rawData = sc.textFile(path) val edges = rawData.map(convertToEdges) val vert = sc.textFile(path).map(f => f.toLong) val myGraph = (verts, edges, 1L) def convertToEdges(line : String) : Edge[Long] = { val txt = line.split(",") new Edge(txt(0), txt(1), 1L) }

I usually use saveAsText simply because I tend to use several programs to process the same data file, but it really depends on your file system.

Saving graphs in Spark Graphx with HDFS

More articles: