Why does submitting a job to mapreduce take so long in General?

Typically, a 20 node cluster sending a job to process 3 GB (200 partitions) of data takes about 30 seconds and the actual execution is about 1 m. I want to understand what is the bottleneck in the application process and understand the following quote.

Per-MapReduce overhead is significant: MapReduce start / end times are time consuming

Some processes that I know about: 1. Sharing data 2. Sharing jar files.

+7
source share
3 answers

A few things to understand about HDFS and M / R that help to understand this delay:

  • HDFS stores your files as part of the data distributed on several machines called datanodes
  • M / R runs several programs, called mapper, for each of the data blocks or blocks. The output data (key, value) of these cartographers are compiled together as a result using reducers. (Think about summing up different results from several cartographers)
  • Each converter and gearbox is a complete program that is created on this distributed system. It takes time to generate full-fledged programs, even if we say that they did nothing (programs that do not support the OP-card).
  • When the size of the data being processed becomes very large, these appearance times become inconsequential, and it is then that Hadoop shines.

If you need to process a file with 1000 lines of content, you'd better use a regular file reader and processor. The Hadoop infrastructure for creating a process in a distributed system will not bring any benefit, but it will only contribute to the additional overhead of finding datanodes containing relevant pieces of data, running processing programs on them, tracking and collecting results.

Now expand this to 100 bytes of Peta data, and this overhead looks completely negligible compared to the time it takes to process it. Parallelization of processors (cartographers and gearboxes) will show him an advantage.

So, before analyzing the performance of your M / R, you should first take a look at benchmarking your cluster in order to better understand the overhead.

How long does it take to make a scale-down program without operation in a cluster?

Use MRBench for this purpose:

  • MRbench performs a small task several times
  • Checks whether the small tasks on your cluster are being performed flexibly and being executed.
  • Its impact on the HDFS level is very limited.

To run this program, try the following. (Check the correct approach for the latest versions:

hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50 

Surprisingly, one of our dev clusters had 22 seconds.

Another problem is file size.

If the file sizes are smaller than the HDFS block size, then Map / Reduce programs have significant overhead. Usually, Hadoop tries to create a mapper per block. This means that if you have 30 files per 5 KB, then Hadoop may end up with 30 cards appearing in 30 blocks per month, even if the file size is small. This is a real loss, since each overhead of the program is significant compared to the time when it will process small files.

+13
source

As far as I know, there is no single bottleneck that causes latency in work; if it were, then it would be allowed a long time ago.

There are several steps that take time, and there are reasons why the process is slow. I will try to list them and evaluate where I can:

  • Launch the hadoop client. It runs Java, and I think it might take about 1 second.
  • Queue the task and let the current scheduler run the task. I'm not sure there is overhead, but due to the asynchronous nature of the process, there is some delay.
  • Split calculation.
  • Launch and synchronize tasks. Here we are faced with the fact that TaskTrackes interviewed JobTracker, and not vice versa. I think this is done for scalability. This means that when JobTracker wants to complete a task, it does not call the task manager, but wait for the corresponding tracker to execute it in order to receive the task. Task controllers cannot ping JobTracker often, otherwise they will kill it in large clusters.
  • Performance of tasks. Without reusing the JVM, this takes about 3 seconds, with an overhead of about 1 second per task.
  • Tracking customer survey results for results (at least I think so), and also adds some delay to getting job completion information.
+5
source

I saw a similar problem, and I can say that the solution will be divided into the following steps:

  • When HDFS stores too many small files with a fixed chunk size, there will be performance problems in HDFS, the best way would be to delete all unnecessary files and small files that have data. Try again.
  • Try with data nodes and name nodes:

    • Stop all services using stop -all.sh.
    • Node format name
    • Machine reboot
    • Start all services using start -all.sh
    • Check data and name nodes.
  • Try installing a lower version of hadoop (hadoop 2.5.2), which worked in two cases and worked in the hit and trial version.

0
source

All Articles