The standard reduce accepts a wrapped version of the function and uses its mapPartitions . After that, the results will be collected and reduced locally on the driver. If the number of partitions is large and / or the function that you use is expensive, it imposes a significant load on one machine.
The first phase of treeReduce almost the same as above, but after the partial results are combined in parallel, and only the final aggregation is performed on the driver.
depth suggested depth of the tree , and since the depth of the node tree in the tree is defined as the number of edges between the root and node, you should give you a more or less expected pattern, although in some cases the early distribution of distributed aggregation may be stopped .
It should be noted that what you get with treeReduce is not a binary tree. The number of sections is configured at each level and, most likely, more than two sections will be merged at once.
Compared to the standard reduction, the tree version performs reduceByKey with each iteration , and that means a lot of data shuffling. If the number of partitions is relatively small, it will be much cheaper to use regular reduce . If you suspect that the final phase of reduce is a bottleneck in the tree* version, it's worth a try.
zero323
source share