You may already know that a combiner is a process that runs locally on each Mapper to pre-aggregate data before it is shuffled across the network to various cluster gearboxes.
The in-mapper compiler performs this optimization a little further: the aggregates do not even write to the local disk: they are found inside the memory in the Mapper itself.
The in-mapper compiler does this using the setup () and cleanup () methods
org.apache.hadoop.mapreduce.Mapper
:
Map<LongWritable, Text> inmemMap = null
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
inmemMap = new Map<LongWritable, Text>();
}
map() , ( context.write() . , Map/Reduce :
protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
for (LongWritable key : inmemMap.keySet()) {
Text myAggregatedText = doAggregation(inmemMap.get(key))
the inmemMap.
context.write(key, myAggregatedText);
}
}
, context.write() . cleanup() context.write(), / . ( ) .
- , - - . .