Do not think that in Hadoop the same key can be mapped to multiple gearboxes. But the keys can be divided so that the gears are more or less evenly loaded. To do this, the input data must be rejected, and the keys must be separated accordingly. For more information on custom separator, see Yahoo Paper . Yahoo's sort code is in the org.apache.hadoop.examples.terasort package.
Suppose key A has 10 lines, B has 20 lines, C has 30 lines, and D has 60 lines of input. Then the keys A, B, C can be sent to gearbox 1, and the key D can be sent to gearbox 2 to evenly distribute the load on the gearboxes. To separate keys, you need to select input to know how keys are distributed.
Here are some more tips to speed up your work.
Point Combiner to JobConf to reduce the number of keys sent to the gearbox. It also reduces network traffic between the processor and the gearbox tasks. Although, there is no guarantee that the combiner will be invoked by the Hadoop infrastructure.
In addition, since the data is distorted (some of them are repeated over and over, say, “tools”), you may want to increase # reduce tasks in order to complete the job faster. This ensures that while the gearbox processes the “tools”, other data is simultaneously processed by other gearboxes.
source share