Hadoop MapReduce - one output file for each input

Question

Hadoop MapReduce - one output file for each input

I am new to Hadoop and I am trying to understand how this works. As for the exercise, I have to implement something similar to WordCount-Example. The challenge is to read in multiple files, make a WordCount, and write an output file for each input file. Hadoop uses a combiner and moves the output of a part of the card as an input to the reducer, and then writes one output file (I think for each instance that starts). I was wondering if it is possible to write one output file for each input file (so keep the words inputfile1 and write the result to outputfile1, etc.). Is it possible to overwrite Combiner-Class or is there another solution for this (I'm not sure if this even needs to be solved in the Hadoop-Task, but this is an exercise).

Thank...

+5

java mapreduce hadoop

spooky Jan 16 '12 at 21:08

source share

2 answers

Hadoop "chunks" data into blocks of configured size. The default value is 64 MB. You can see where this causes problems for your approach; Each handler can receive only part of the file. If the file is less than 64 MB (or any other value is configured), then each cartographer will receive only 1 file.

; ( ), . < 64MB , , , , / . - , " , " - .:)

, MR, . , , . < 64MB, . map ( 1 ).
, , , . hadoop , Map/Reduce, , . mapred.reduce.tasks. job.setNumReduceTasks("mapred.reduce.tasks",[NUMBER OF FILES HERE]);

/, ; 1: in 1: out; , .

0

Nija 16 . '12 22:02

Praveen sripati · Accepted Answer · 2012-01-17T02:22:19+0000

map.input.fileThe environment parameter has the name of the file that the handler processes. Get this value in mapper and use it as an output key for mapper, and then all k / v from one file to go to one reducer.

Code in the converter. BTW, I'm using the old MR API

@Override
public void configure(JobConf conf) {
    this.conf = conf;
}

@Override.
public void map(................) throws IOException {

        String filename = conf.get("map.input.file");
        output.collect(new Text(filename), value);
}

And use MultipleOutputFormat, it allows you to write multiple output files for the job. File names can be obtained from output keys and values.

Hadoop MapReduce - one output file for each input

More articles: