I am trying to find an algorithm for finding disjoint sets (connected components / union-search) on a large amount of data with apache curvature. The problem is the amount of data. Even the raw representation of the top of the graph is not suitable for use on a single machine. The edges also do not fit into the plunger.
The source data is a text file of graph graphs on hdf: "id1 \ t id2".
id is present as a string value, not int.
The naive solution I found is:
- take rdd edges →
[id1:id2] [id3:id4] [id1:id3] - group edges by key. →
[id1:[id2;id3]][id3:[id4]] - for each record, the minimum identifier of each group →
(flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3] - back rdd from stage 3
[id2:id1] -> [id1:id2] leftOuterJoin from rdds from stages 3 and 4- repeat from step 2, while the rdd size in step 3 does not change.
But this leads to the transfer of large amounts of data between nodes (Shuffling)
Any tips?
source share