MapReduce - What is the use of a word count example?

I am trying to understand what are the benefits of MapReduce, I just read some introductions on it for the first time.

They all use this canonical example of word counting in a large set of documents, but I see no benefit. The following is my real understanding, correct me if I am wrong.

We indicate a list of input files (documents). The MapReduce library takes this list and shares it among the processors in the cluster. Each document on the processor is passed to a map function, which returns a list of pairs in this case.

That's where I'm a little unsure what exactly is going on. Then the library software searches for a variety of results on all different processors and groups these pairs with the same word (key). These groups are collected on different processors, and a reduction is called for each group on this processor.

Combined results are then collected on the main node.

Is this the correct interpretation?

What I do not understand, since it is necessary to sort all the results for grouping keys, why not just count the keys that he finds at the same time, why is this necessary? How does this process save time when it seems that there is a lot of work to find and combine common keys?

+4
source share
1 answer

Here is a good video in YouTube Video using the MapReduce algorithm, if you watch the full series of 5 videos, it will give you much more clarity on MapReduce and answer most of your queries.

What I do not understand, since it is necessary to sort all the results for grouping keys, why not just count the keys that he finds at the same time, why is this necessary? How does this process save time when it seems that there is a lot of work to find and combine common keys?

Since a key / value pair for a specific word, such as a β€œsample” from a word count example, can be selected by different map tasks and distributed among different nodes, these key / value pairs must be combined / sorted before being sent to the reduction task. Reduce the task for a specific key is performed on a single node and is not distributed.

FYI, the map task results are combined using a combiner class (which matches the reducer class) on the same node as the map task to reduce network chatter between cartographers and reducers.

+4
source

All Articles