I am trying to understand what are the benefits of MapReduce, I just read some introductions on it for the first time.
They all use this canonical example of word counting in a large set of documents, but I see no benefit. The following is my real understanding, correct me if I am wrong.
We indicate a list of input files (documents). The MapReduce library takes this list and shares it among the processors in the cluster. Each document on the processor is passed to a map function, which returns a list of pairs in this case.
That's where I'm a little unsure what exactly is going on. Then the library software searches for a variety of results on all different processors and groups these pairs with the same word (key). These groups are collected on different processors, and a reduction is called for each group on this processor.
Combined results are then collected on the main node.
Is this the correct interpretation?
What I do not understand, since it is necessary to sort all the results for grouping keys, why not just count the keys that he finds at the same time, why is this necessary? How does this process save time when it seems that there is a lot of work to find and combine common keys?
Joe source share