In Hadoop, where do the frames save the output of the Map task in the regular Zoom Out application?

I am trying to find out where the output of the map task is saved to disk before it can be used by the Reduce task.

Note: - the used version of Hadoop 0.20.204 with the new API

For example, when rewriting a map method in the Map class:

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } // code that starts a new Job. } 

I am interested to know where end.write () finishes writing data. So far I have come across:

 FileOutputFormat.getWorkOutputPath(context); 

Which gives me the following location in hdf:

 hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0 

When I try to use it as input for another job, it causes the following error:

 org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0 

Note. the task runs in Mapper, so technically the temporary folder in which the Mapper task writes its output exists when a new task starts. Again, he still says that the input path does not exist.

Any ideas on where the temporary output is written? Or, maybe, in what place, where can I find the result of the "Map" task during a task that has both a "Map" and a "Reduce" stage?

+7
source share
3 answers

So, I realized what really happens.

The output of the handler is buffered until it reaches approximately 80% of its size, at which point it will begin to output the result to its local disk and continue to accept elements to the buffer.

I wanted to get the intermediate output from the mapper and use it as input for another job, while the mapper was still working. It turns out that this is impossible without a significant change in the deployment of the haupa 0.20.204. The way the system works even after all the things that are indicated in the context of the map:

 map .... { setup(context) . . cleanup(context) } 

and cleaning is called, there is still no reset to the temporary folder.

After that, all Map calculations are ultimately combined and flushed to disk and become input to the shuffle and sort steps that precede the Reducer.

Until now, I have not read or looked, the temporary folder in which there should be an exit is the one that I had previously guessed.

 FileOutputFormat.getWorkOutputPath(context) 

I managed to do what I wanted to do differently. Anyway, any questions that may arise about this, let me know.

+3
source

Map reduce framework will store intermediate output to a local drive, not HDFS, as this will result in unnecessary file replication.

+4
source

Task tracker starts a separate JVM process for each Map or Reduce task.

The Mapper output (intermediate data) is written to the local file system (NOT HDFS) for each sub display node device. After transferring data to Reducer, we will not be able to access these temporary files.

If you want to see your Mapper output, I suggest using IdentityReducer ?

+2
source

All Articles