Hadoop - How does the gearbox receive data?

I understand that the cartographer produces 1 section per gearbox. How does the gearbox know which section to copy? Suppose there are 2 nodes using mapper for a word count program, and there are 2 reducers. If each node card creates 2 partitions, with the possibility of separation in both nodes containing the same word as a key, how will the reducer work correctly?

For ex:

If node 1 creates sections 1 and section 2, and section 1 contains a key named "WHO".

If node 2 creates sections 3 and section 4, and section 3 contains a key named "WHO".

If section 1 and section 4 went to gear 1 (and stayed in gear 2), how does gear 1 calculate the correct number of words?

If this is not possible, and sections 1 and 3 will be made to go to gear 1, how does Hadoop do it? Does he certify that a given key-value pair from different nodes always goes to the same gearbox? If so, how is this done?

Thanks, Suresh.

+4
source share
1 answer

In your situation, since section 1 and section 3 are both with the β€œWHO” key, it is guaranteed that these two sections have switched to the same gearbox.

Update

In hasoop, the maximum number of reduction tasks, one of which the controller sets at any time, is determined by the mapred.tasktracker.reduce.tasks.maximum property.
And the number of reducers for MapReduce jobs is set via -D mapred.reduce.tasks=n

When there are several reducers, map tasks share their output, each of which creates one section for each reduction task. Each section can have many keys (and associated values), but the entries for any given key are in the same section. Separation can be controlled using a user-defined partition function, but the default delimiter is usually used, which drives keys using a hash function. (Hadoop: The Ultimate Guide)

So, the value with the specified key will always refer to the same gearbox.

+4
source

Source: https://habr.com/ru/post/1411682/


All Articles