Is it possible to restrict MapReduce to access remote data?

We have a special algorithm that we want to integrate with HDFS. The algorithm requires us to access local data (the work will be performed exclusively in Mapper). However, we want to take advantage of HDFS in terms of file distribution (ensuring reliability and interleaving). After performing the calculation, we will use Reducerit to simply send the answer, and not to do any additional work. The explicit goal is to avoid using the network. Is there a configuration parameter that will allow us to restrict access to network data so that when the MapReduce job starts, it only gets access to the local DataNode?

UPDATE: Adding a little context

We are trying to analyze this problem when matching strings. Suppose our cluster has N nodes and the file is stored with N GB of text. The file is stored in HDFS and distributed on even parts of nodes (1 part on node). Can we create a MapReduce job that starts one process on each node to access the part of the file that is on the same host? Or will the MapReduce framework distribute work unevenly? (for example, 1 job accessing the entire N part of the data, or .5N nodes trying to process the entire file?

+1
source share
2 answers

, , , .

job.setNumReduceTasks(0);

, , , , , . , .

google , : MR-

+2

. , , Mappers, HDFS. , , , , * , . - locality-delay- node -ms locality-delay-rack-ms (.. ). , . (, node, - , - node -ms + locality-delay-rack -MS).

+1

All Articles