Where should the card put temporary files when working under Hadoop

I am running Hadoop 0.20.1 under SLES 10 (SUSE).

My map task takes a file and generates a few more, then I generate the results from these files. I would like to know where I should place these files so that the performance is good and there are no conflicts. If Hadoop can delete the directory automatically, that would be nice.

I am currently using a temporary folder and task id to create a unique folder and then work in subfolders of this folder.

reduceTaskId = job.get("mapred.task.id"); reduceTempDir = job.get("mapred.temp.dir"); String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator; File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER); 

The problem with this approach is that I'm not sure if this is optimal, also I have to delete every new folder, or I will start to escape from space. thanks akintayo

(edit) I found that the best place to store files that you don’t want outside of the life of the map is job.get ("job.local.dir") , which provides a path that will be deleted when the map tasks complete. I am not sure if the deletion is done based on a key or for each tasktracker.

+4
source share
1 answer

The problem with this approach is that sorting and moving will move your data away from where that data was localized.

I know little about your data, but a distributed cache may work well for you.

$ {mapred.local.dir} / taskTracker / archive /: distributed cache. This directory contains a localized distributed cache. Thus, a localized distributed cache is common to all tasks and tasks.

http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

"Typically, MapReduce requires each file to read one or more files or reduce the task to completion. For example, you might have a lookup table that needs to be analyzed before processing a set of records. To eliminate this scenario, the Hadoops MapReduce implementation includes a distributed a cache file that will manage the copying of the file (s) to the task nodes.

DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more details on its origin. There is a lot of documentation for DistributedCache: see the Hadoop FAQ, MapReduce Tutorial, Hadoop Javadoc, and Hadoop Streaming Tutorial. After you read the existing documentation and understand how to use DistributedCache, go back. "

0
source

All Articles