I built a graph in Spark GraphX. This graph will have a potential of 1 billion nodes and more than 10 billion edges, so I do not want to build this graph again and again.
I want to be able to create it once, save it (I think it's best in HDFS), start some processes on it, and then access it after a couple of days or weeks, add new nodes and edges, and start a few more processes on it.
How can I do this in Apache Spark GraphX?
EDIT: I think I found a potential solution, but I would like someone to confirm that this is the best way.
If I have a graph, say graph , I have to store the graph by its vertexRDD and its edgeRDD separately in a text file. Then, later in time, I can access these text files, for example:
graph.vertices.saveAsTextFile(somePath) graph.edges.saveAsTextFile(somePath)
Now I have one question: should I use saveAsTextFile () or saveAsObjectFile ()? And then, how should I access this file later?
source share