Disjoint sets on apache sparks

I am trying to find an algorithm for finding disjoint sets (connected components / union-search) on a large amount of data with apache curvature. The problem is the amount of data. Even the raw representation of the top of the graph is not suitable for use on a single machine. The edges also do not fit into the plunger.

The source data is a text file of graph graphs on hdf: "id1 \ t id2".

id is present as a string value, not int.

The naive solution I found is:

  • take rdd edges → [id1:id2] [id3:id4] [id1:id3]
  • group edges by key. → [id1:[id2;id3]][id3:[id4]]
  • for each record, the minimum identifier of each group → (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
  • back rdd from stage 3 [id2:id1] -> [id1:id2]
  • leftOuterJoin from rdds from stages 3 and 4
  • repeat from step 2, while the rdd size in step 3 does not change.

But this leads to the transfer of large amounts of data between nodes (Shuffling)

Any tips?

+6
source share
1 answer

If you work with charts, I would suggest you take a look at one of these libraries

They both provide an algorithm for connected components out of the box.

Graphx

 val graph: Graph = ... val cc = graph.connectedComponents().vertices 

Graphframes

 val graph: GraphFrame = ... val cc = graph.connectedComponents.run() cc.select("id", "component").orderBy("component").show() 
0
source

All Articles