Number of gearboxes for 1 task in MapReduce

In a typical MapReduce setup (e.g. Hadoop), how many reducers are used for a single task, such as word counting? My understanding is that MapReduce from Google means only 1 reducer. It is right?

For example, a word counter will divide the input into N pieces, and the N Map will work by creating a list (word, #). My question is that after the map phase is completed, will only one instance of the gearbox be started to calculate the result? or parallel parallel gears?

+8
mapreduce hadoop
source share
5 answers

The simple answer is that the number of gearboxes should not be 1 and yes, gearboxes can work in parallel. As I mentioned above, this is user defined or defined.

To keep things in context, I will refer to Hadoop in this case, so that you have an idea of ​​how things work. If you use the streaming API in Hadoop (0.20.2), you will need to explicitly determine how many gearboxes you want to run by default, only one reduction task will be launched. You do this by passing the number of reducers to the -D mapred.reduce.tasks=# of reducers argument of -D mapred.reduce.tasks=# of reducers . The Java API will try to get the number of reducers you need, but you can also explicitly set this. In both cases, there is a hard count on the number of reducers that you can run per node, and it is set in the mapred-site.xml configuration file using mapred.tasktracker.reduce.tasks.maximum .

In a more comprehensible note, you can see this article on the wiki wiki that talks about choosing the number of maps and reducing tasks.

+13
source share

It completely depends on the situation. In some cases, you don’t have any gears ... everything can be done on the map. In other cases, you cannot avoid using a single gearbox, but this usually happens on a 2nd or 3rd card / job reduction, which condenses earlier results. As a rule, however, you want to have a lot of gears, otherwise you lose a lot of MapReduce features! For example, in word counting, the result of your cards will be a couple. These pairs are then broken up by word so that each gearbox will receive the same words and can give you the final amount. Then each gearbox outputs the result. If you wanted to, you could shoot another M / R task that occupied all these files and merged them - this job would have only one reducer.

+1
source share

In the case of a simple wordcount example, it would be advisable to use only one gearbox.
If you want to get only one number as a result of the calculation, you should use one reducer (2 or more reducers will give you 2 or more output files).

If this gearbox takes a long time, you can think of a chain of several gearboxes, where gearboxes in the next phase will summarize the results of previous gearboxes.

+1
source share

Gearboxes work in parallel. The number of gearbox that you set in your work when changing the mapred-site.xml configuration file, or by installing the gearbox during the job execution command, or you can install it in the program, and the number of gearboxes will be executed in parallel. It does not need to be kept as 1. By default, its value is 1.

0
source share

The default value is 1. If you are considering a hive or pigs, then the number of gears depends on the request, for example, on the group, the amount .....

In the case of ur mapreduce code, it can be defined using setNumReduceTasks in job / conf.

 job.setNumReduceTasks(3); 

In most cases, this happens when you overwrite getPartition() , i.e. use custom delimiter

 class customPartitioner extends Partitioner<Text,Text>{ public int getPartition(Text key, Text value, int numReduceTasks){ if(numReduceTasks==0) return 0; if(some logic) return 0; if(some logic) return 1; else return 2; } } 

One thing you'll notice is the number of gears = the file number of the part in the output.

Let me know if you have any doubts.

0
source share

All Articles