Custom Separator Example

I am trying to write a new Hadoop job for input that is somewhat distorted. An analogue for this will be an example of word counting in a Hadoop tutorial, unless one particular word is present many times.

I want to have a partition function where this one key will be mapped onto several reducers and the remaining keys according to their usual hash passivation. Is it possible?

Thanks in advance.

+4
source share
2 answers

Do not think that in Hadoop the same key can be mapped to multiple gearboxes. But the keys can be divided so that the gears are more or less evenly loaded. To do this, the input data must be rejected, and the keys must be separated accordingly. For more information on custom separator, see Yahoo Paper . Yahoo's sort code is in the org.apache.hadoop.examples.terasort package.

Suppose key A has 10 lines, B has 20 lines, C has 30 lines, and D has 60 lines of input. Then the keys A, B, C can be sent to gearbox 1, and the key D can be sent to gearbox 2 to evenly distribute the load on the gearboxes. To separate keys, you need to select input to know how keys are distributed.

Here are some more tips to speed up your work.

Point Combiner to JobConf to reduce the number of keys sent to the gearbox. It also reduces network traffic between the processor and the gearbox tasks. Although, there is no guarantee that the combiner will be invoked by the Hadoop infrastructure.

In addition, since the data is distorted (some of them are repeated over and over, say, “tools”), you may want to increase # reduce tasks in order to complete the job faster. This ensures that while the gearbox processes the “tools”, other data is simultaneously processed by other gearboxes.

+5
source

If you split data across multiple gearboxes for performance reasons, you will need a second gearbox to aggregate the data into the final result set.

Hadoop has a built-in function that does something like this: a combiner.

The combiner is the functionality of the "gearbox". This ensures that, within the scope of the map task, a partial reduction can be made from the data and as such reduces the number of records that need to be processed later.

In the base wordcount example, the combiner is exactly the same as the reducer. Note that for some algorithms you will need a different implementation for these two. I also had a project in which a combinator was not possible due to the algorithm.

+1
source

All Articles