Now I have a 4-phase MapReduce job as follows:
Input-> Map1 -> Reduce1 -> Reducer2 -> Reduce3 -> Reduce4 -> Output
I noticed that in Hadoop there is a ChainMapper class that can link several cartographers into one large cartographer and save the cost of disk I / O between the phases of the card. There is also a ChainReducer class, but it is not a true "Chain-Reducer". It can only support jobs such as:
[Map+/ Reduce Map*]
I know that I can install four MR jobs for my task and use the default mapping for the last three jobs. But it will cost a lot of disk I / O, because reducers must write the result to disk so that the next cartographer can access it. Is there another built-in Hadoop feature to connect my gearboxes to reduce I / O costs?
I am using Hadoop 1.0.4.
Denzel
source share