When do job reductions begin in Hadoop?

In Hadoop, when do reduction tasks begin? Do they start after a certain percentage (threshold) of cartographers is completed? If yes, is this threshold set? What threshold is commonly used?

+68
mapreduce reduce hadoop
Jul 26 '12 at 15:25
source share
8 answers

The reduction phase has 3 steps: shuffling, sorting, decreasing. Shuffle is the place where data is collected by the reducer from each mapper. This can happen when mappers generate data, as this is only data transfer. Sorting and decreasing, on the other hand, can only begin after all cards have been completed. You can tell which one MapReduce does, looking at the percentage of recovery completed: 0-33% means that it is shuffling, 34-66% is sorted, 67% -100% is reduced. That's why your gearboxes sometimes seem “stuck” by 33% - they wait for the mappers to run out.

Reducers begin to shuffle depending on the threshold of the percentage of completed copies. You can change this parameter to start gearboxes sooner or later.

Why gearboxes start early? Because over time, it spreads the transfer of data from cards to gearboxes, which is good if your network is a bottleneck.

Why are starter gears early bad? Because they "hog up" reduce the number of slots, only copying data and waiting for the display to complete. Another task, which begins later, which will actually use slots with reduction, can no longer use them.

You can configure it when starting reducers by changing the default mapred.reduce.slowstart.completed.maps to mapred-site.xml . A value of 1.00 will wait for all mappers to complete before starting the gearboxes. A value of 0.0 immediately start the gearboxes. A value of 0.5 will start the reducers when half of them are completed. You can also change mapred.reduce.slowstart.completed.maps based on work. In newer versions of Hadoop (at least 2.4.1), the parameter is called mapreduce.job.reduce.slowstart.completedmaps (thanks to user yegor256).

As a rule, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever starts several tasks at once. Thus, the job will not let the gears down when they do nothing but copy the data. If you only ever perform one task, then 0.1 is likely to be appropriate.

+179
Jul 26 '12 at 16:27
source share

The reduction phase can begin long before the gearbox is called. As soon as the "a" handler exits, the generated data undergoes some sorting and shuffling (which includes calling the combiner and delimiter). The initial phase of the gearbox begins at the moment the data processing begins after mapping. As you complete this treatment, you will see progress in the percentage of gearboxes. However, none of the gearboxes has yet been called. Depending on the number of available / used processors, the nature of the data and the number of expected gearboxes, you can change the parameter as described above. Donald miner above.

+3
Dec 18 '13 at 0:56
source share

As far as I understand, reduce the beginning of the phase from the map phase and continue to use the recording on the cards. However, since the sorting and shuffling phase after the card phase, all outputs must be sorted and sent to the gearbox. Thus, logically, you can imagine that the reduction phase begins only after the map phase, but in fact, to reduce performance, the reducers are also initialized by cartographers.

+1
Jul 26 2018-12-12T00:
source share

When Mapper completes its task, Reducer begins its work to reduce data, this is Mapreduce's job.

+1
Oct 30 '16 at 9:02
source share

The percentage shown for the reduction phase is actually the amount of data copied from the output of the cards to the I / O directories of the gearboxes. Know when this copying starts? This is a configuration that you can set, as Donald showed above. As soon as all the data is copied to the gearboxes (that is, 100% reduced), when the gearboxes start working and, therefore, they can freeze in a “100% reduction” if the gearbox code is the I / O or processor intensity.

0
Sep 30 '13 at 16:46
source share

Consider the WordCount example to better understand how map reduction works. Suppose we have a large file, say, a novel, and our task is to find the number of times each word occurs in the file. Since the file is large, it can be divided into different blocks and replicated on different work nodes. The word count task consists of a map and task reduction. The map task takes each block as input and creates an intermediate key-value pair. In this example, since we count the number of occurrences of words, the handler in the process of processing the block will lead to intermediate results of the form (word1, count1), (word2, count2), etc. The intermediate results of all cartographers went through the shuffling phase, which will change the order of the intermediate result.

Suppose that our output, drawn from different cards, has the following form:

Map 1: - (This, 24) (Was, 32) (And, 12)

Map2: - (Mine, 12) (This, 23) (Was, 30)

Card outputs are sorted so that the same key values ​​are assigned to the same gearbox. Here it will mean that there are corresponding keys, etc. Go to the same gearbox. This is a reducer that produces the final result, which in this case will be: (I, 12) (this, 47) (mine, 12) (was, 62)

0
Apr 14 '14 at 9:25
source share

The tasks of the reducer begin only after the completion all the cartographers.

But data transfer occurs after each card. This is actually a push operation.

This means that every time the reducer will request each card, if they have some data to extract from the card. If they find that any cartographer has completed their task, Reducer will pull out the intermediate data.

Intermediate data from Mapper is stored in disk . And data transfer from Mapper for reduction takes place through the network ( Data Locality not saved in the reduction phase)

0
Apr 20 '14 at 8:30
source share

The reduction begins only after all the handlers have selected the task, Reducer must contact all the cartographers, so he must wait until the last handler completes his task. However, when mapper begins to transfer data until the completion of its task.

0
Nov 25 '14 at 6:45
source share



All Articles