Spark CollectAsMap

I would like to know how collectAsMap works in Spark. In particular, I would like to know where the data will be aggregated from all partitions? Aggregation occurs either at the master or at the workers. In the first case, each employee sends their data to the master and when the master collects data from each one worker, then the master will aggregate the results. In the second case, the workers are responsible for combining the results (after exchanging data between them), after which the results will be sent to the master.

It is very important for me to find a way so that the wizard can collect data from each section separately, without exchanging data between employees.

+7
distributed-computing apache-spark worker
source share
2 answers

You can see how they do collectAsMap here. Since the RDD type is a tuple, it looks like they just use a regular RDD collector and then translate the tuples into a map of key pairs, values. But they note in the comment that the multi-card is not supported, so you need to map the key / value 1 to 1 according to your data.

collectAsMap function

What it collects does the Spark job and returns the results from each section from the workers and minimizes them with the / concat reduction phase on the driver.

collect function

So, provided that it should be that the driver collects data from each section separately, if the workers do not exchange data to execute collectAsMap .

Please note that if you do conversions on your RDD before using collectAsMap , which cause a random move, there may be an intermediate step that forces workers to exchange data with each other. Check your cluster application user interface to find out more about how spark performs your application.

+6
source share

First of all, in both operations, all your data that is present in the RDD will be transferred from different artists / workers to Master / Driver. Both collecting and collecting cards AsMap will simply collect data from different artists / workers. So that's why it is always recommended not to use the collection until you have another option.

I have to say that this is the last collection that needs to be considered in terms of performance.

  • collect : will return the results as an array.
  • collectAsMap will return the results for a paired RDD as a Map collection. And since it returns the Map collection, you will only get pairs with unique keys, and pairs with duplicate keys will be deleted.

Hi,

Neeraj

+1
source share

All Articles