Disjoint sets on apache sparks

Question

Disjoint sets on apache sparks

I am trying to find an algorithm for finding disjoint sets (connected components / union-search) on a large amount of data with apache curvature. The problem is the amount of data. Even the raw representation of the top of the graph is not suitable for use on a single machine. The edges also do not fit into the plunger.

The source data is a text file of graph graphs on hdf: "id1 \ t id2".

id is present as a string value, not int.

The naive solution I found is:

take rdd edges → [id1:id2] [id3:id4] [id1:id3]
group edges by key. → [id1:[id2;id3]][id3:[id4]]
for each record, the minimum identifier of each group → (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
back rdd from stage 3 [id2:id1] -> [id1:id2]
leftOuterJoin from rdds from stages 3 and 4
repeat from step 2, while the rdd size in step 3 does not change.

But this leads to the transfer of large amounts of data between nodes (Shuffling)

Any tips?

+6

algorithm graph-theory mapreduce apache-spark disjoint-sets

Puh May 18, '16 at 10:39

source share

1 answer

Gevorg · Answer 1 · 2017-06-14T14:28:10+0000

If you work with charts, I would suggest you take a look at one of these libraries

They both provide an algorithm for connected components out of the box.

Graphx

 val graph: Graph = ... val cc = graph.connectedComponents().vertices

Graphframes

 val graph: GraphFrame = ... val cc = graph.connectedComponents.run() cc.select("id", "component").orderBy("component").show()

Disjoint sets on apache sparks

More articles: