Reducers will always work in parallel, regardless of whether you use streaming or not (if you do not see this, make sure that several reduction tasks are set for the job configuration - see mapred.reduce.tasks in your cluster or task). The difference is that packages package packages are a little better for you when you use Java against streaming.
For Java, the reduction task gets an iterator over all values ββfor a specific key. This makes it easy to move values ββif you, say, summarize the output of a card in your reduction task. In streaming mode, you literally just get a stream of key-value pairs. You you guarantee that the values ββwill be sorted by key, and that for this key it will not be divided into task reduction, but any state tracking is up to you. For example, in Java, the output of your map is symbolically found in your reducer in the form
key1, {val1, val2, val3} key2, {val7, val8}
When streaming, your result looks like
key1, val1 key1, val2 key1, val3 key2, val7 key2, val8
For example, to write a reducer that calculates the sum of the values ββfor each key, you need a variable to store the last key that you saw, and a variable to store the sum. Each time you read a new key-value pair, you do the following:
- check if the key is different from the last key.
- if so, print your key and current amount, and reset the amount to zero.
- add the current value to your amount and set the last key for the current key.
NTN.
source share