Parallel Ruby Gearboxes in Hadoop?

A simple wordcount reducer in Ruby looks like this:

#!/usr/bin/env ruby wordcount = Hash.new STDIN.each_line do |line| keyval = line.split("|") wordcount[keyval[0]] = wordcount[keyval[0]].to_i+keyval[1].to_i end wordcount.each_pair do |word,count| puts "#{word}|#{count}" end 

it gets in stdin all intermediate mappers. Not from a specific key. So in fact there is only one reducer for all (and not a reducer per word or per set of words).

However, in Java examples, I saw this interface, which receives the key and the list of values ​​as inout. This means that the values ​​of the intermediate cards are grouped by key to decrease, and the gearboxes can work in parallel:

 public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } 

Is this just a Java feature? Or can I do this using Hadoop Streaming using Ruby?

+4
source share
2 answers

Reducers will always work in parallel, regardless of whether you use streaming or not (if you do not see this, make sure that several reduction tasks are set for the job configuration - see mapred.reduce.tasks in your cluster or task). The difference is that packages package packages are a little better for you when you use Java against streaming.

For Java, the reduction task gets an iterator over all values ​​for a specific key. This makes it easy to move values ​​if you, say, summarize the output of a card in your reduction task. In streaming mode, you literally just get a stream of key-value pairs. You you guarantee that the values ​​will be sorted by key, and that for this key it will not be divided into task reduction, but any state tracking is up to you. For example, in Java, the output of your map is symbolically found in your reducer in the form

key1, {val1, val2, val3} key2, {val7, val8}

When streaming, your result looks like

key1, val1 key1, val2 key1, val3 key2, val7 key2, val8

For example, to write a reducer that calculates the sum of the values ​​for each key, you need a variable to store the last key that you saw, and a variable to store the sum. Each time you read a new key-value pair, you do the following:

  • check if the key is different from the last key.
  • if so, print your key and current amount, and reset the amount to zero.
  • add the current value to your amount and set the last key for the current key.

NTN.

+5
source

I have not tried Hadoop Streaming myself, but by reading the docs I think you can achieve similar parallel behavior.

Instead of transmitting a key with the corresponding values ​​for each gearbox, streaming will group the card output using keys. It also ensures that values ​​with the same keys will not be divided into multiple gearboxes. This is somewhat different from the usual Hadoop functionality, but even so, the reduction work will be distributed across several gearboxes.

Try using the -verbose option to get more information about what is actually happening. You can also try experimenting with the -D mapred.reduce.tasks=X option -D mapred.reduce.tasks=X , where X is the desired number of gearboxes.

+1
source

All Articles