Understanding treeReduce () in Spark

Question

Understanding treeReduce () in Spark

You can see the implementation here: https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804

How is it different from the "normal" reduce function?
What does depth = 2 mean?

I don’t want the reducer function to go linearly across the sections, but first reduce each available pair, and then move on until I have only one pair and reduce it to 1, as shown in the figure:

Does this treeReduce ?

+7

python reduce apache-spark pyspark rdd

member555 Aug 29 '15 at 0:14

source share

1 answer

zero323 · Accepted Answer · 2015-08-29T00:49:24+0000

The standard reduce accepts a wrapped version of the function and uses its mapPartitions . After that, the results will be collected and reduced locally on the driver. If the number of partitions is large and / or the function that you use is expensive, it imposes a significant load on one machine.

The first phase of treeReduce almost the same as above, but after the partial results are combined in parallel, and only the final aggregation is performed on the driver.

depth suggested depth of the tree , and since the depth of the node tree in the tree is defined as the number of edges between the root and node, you should give you a more or less expected pattern, although in some cases the early distribution of distributed aggregation may be stopped .

It should be noted that what you get with treeReduce is not a binary tree. The number of sections is configured at each level and, most likely, more than two sections will be merged at once.

Compared to the standard reduction, the tree version performs reduceByKey with each iteration , and that means a lot of data shuffling. If the number of partitions is relatively small, it will be much cheaper to use regular reduce . If you suspect that the final phase of reduce is a bottleneck in the tree* version, it's worth a try.

Understanding treeReduce () in Spark

More articles: