Is the format of the input responsible for the implementation of location data in the Hadoop MapReduce?

I'm trying to understand the location of the data relating to the structure of the Hadoop Map / Reduce. In particular, I'm trying to figure out which components to process local data (ie, the input format: This?)

Yahoo Developer Network Page says, "Then the Hadoop infrastructure planning these processes in close proximity to the data / records the location, using the knowledge of the distributed file system." This seems to imply that HDFS input format may ask node name to determine which nodes contain the required data, and, if possible, launch the task cards on these sites. It can be assumed that a similar approach could be used with HBase by request, to determine which regions serve specific records.

If a developer writes its own input format, they will be responsible for the implementation of data locality?

+4
source share
2 answers

You're right. If you are looking for class FileInputFormat and method getSplits() . He is looking for locks:

BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

This implies a request FileSystem. This occurs within JobClient , the results are written in SequenceFile (in fact it is just the original byte code). Therefore Jobtracker reads this file later during the initialization tasks and greatly simply assigns the task to inputplit.

BUT distribution of data - a task NameNodes.

Now to your question: Normally you expand on FileInputFormat . Thus, you will be forced to return the list InputSplit , and in the initialization phase is required to specify the location of such a thing separation. For example, FileSplit :

 public FileSplit(Path file, long start, long length, String[] hosts) 

So, you do not implement a data location, you just say, on what the host can be found split. This can easily be retrieved using the interface FileSystem .

+7
source

We understand that data locality is jointly determined by HDFS and InputFormat. The first determines (through the notification of a rack) and stores the location of HDFS blocks through DataNodes, and the latter determines which units are connected to the section. Worktracker attempt to optimize, which splits to be transported to the job card, making sure that the blocks are connected to each partition (separation 1 1 mapping maps tasks) are local to tasktracker.

Unfortunately, this approach to guarantee locally stored in homogeneous clusters, but will be destroyed in heterogeneous, that is, in cases where each channel has different size hard drives. If you want to delve into it, you should read this article ( to improve MapReduce performance by placing data in a heterogeneous cluster of chaos ), which also covers several topics regarding your question.

0
source

All Articles