The reduction phase has 3 steps: shuffling, sorting, decreasing. Shuffle is the place where data is collected by the reducer from each mapper. This can happen when mappers generate data, as this is only data transfer. Sorting and decreasing, on the other hand, can only begin after all cards have been completed. You can tell which one MapReduce does, looking at the percentage of recovery completed: 0-33% means that it is shuffling, 34-66% is sorted, 67% -100% is reduced. That's why your gearboxes sometimes seem “stuck” by 33% - they wait for the mappers to run out.
Reducers begin to shuffle depending on the threshold of the percentage of completed copies. You can change this parameter to start gearboxes sooner or later.
Why gearboxes start early? Because over time, it spreads the transfer of data from cards to gearboxes, which is good if your network is a bottleneck.
Why are starter gears early bad? Because they "hog up" reduce the number of slots, only copying data and waiting for the display to complete. Another task, which begins later, which will actually use slots with reduction, can no longer use them.
You can configure it when starting reducers by changing the default mapred.reduce.slowstart.completed.maps to mapred-site.xml . A value of 1.00 will wait for all mappers to complete before starting the gearboxes. A value of 0.0 immediately start the gearboxes. A value of 0.5 will start the reducers when half of them are completed. You can also change mapred.reduce.slowstart.completed.maps based on work. In newer versions of Hadoop (at least 2.4.1), the parameter is called mapreduce.job.reduce.slowstart.completedmaps (thanks to user yegor256).
As a rule, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever starts several tasks at once. Thus, the job will not let the gears down when they do nothing but copy the data. If you only ever perform one task, then 0.1 is likely to be appropriate.
Donald Miner Jul 26 '12 at 16:27 2012-07-26 16:27
source share