Multiple Gear Chain in Hadoop MapReduce job

Now I have a 4-phase MapReduce job as follows:

Input-> Map1 -> Reduce1 -> Reducer2 -> Reduce3 -> Reduce4 -> Output 

I noticed that in Hadoop there is a ChainMapper class that can link several cartographers into one large cartographer and save the cost of disk I / O between the phases of the card. There is also a ChainReducer class, but it is not a true "Chain-Reducer". It can only support jobs such as:

 [Map+/ Reduce Map*] 

I know that I can install four MR jobs for my task and use the default mapping for the last three jobs. But it will cost a lot of disk I / O, because reducers must write the result to disk so that the next cartographer can access it. Is there another built-in Hadoop feature to connect my gearboxes to reduce I / O costs?

I am using Hadoop 1.0.4.

+7
source share
1 answer

I don’t think you can use the o / p gearbox for another gearbox directly . I would go for this:

 Input-> Map1 -> Reduce1 -> Identity mapper -> Reducer2 -> Identity mapper -> Reduce3 -> Identity mapper -> Reduce4 -> Output 

In the Hadoop 2.X series, you can internally connect inverters to the gearbox using ChainMapper and chain Mappers after the gearbox with ChainReducer .

+2
source

All Articles