A few things to understand about HDFS and M / R that help to understand this delay:
- HDFS stores your files as part of the data distributed on several machines called datanodes
- M / R runs several programs, called mapper, for each of the data blocks or blocks. The output data (key, value) of these cartographers are compiled together as a result using reducers. (Think about summing up different results from several cartographers)
- Each converter and gearbox is a complete program that is created on this distributed system. It takes time to generate full-fledged programs, even if we say that they did nothing (programs that do not support the OP-card).
- When the size of the data being processed becomes very large, these appearance times become inconsequential, and it is then that Hadoop shines.
If you need to process a file with 1000 lines of content, you'd better use a regular file reader and processor. The Hadoop infrastructure for creating a process in a distributed system will not bring any benefit, but it will only contribute to the additional overhead of finding datanodes containing relevant pieces of data, running processing programs on them, tracking and collecting results.
Now expand this to 100 bytes of Peta data, and this overhead looks completely negligible compared to the time it takes to process it. Parallelization of processors (cartographers and gearboxes) will show him an advantage.
So, before analyzing the performance of your M / R, you should first take a look at benchmarking your cluster in order to better understand the overhead.
How long does it take to make a scale-down program without operation in a cluster?
Use MRBench for this purpose:
- MRbench performs a small task several times
- Checks whether the small tasks on your cluster are being performed flexibly and being executed.
- Its impact on the HDFS level is very limited.
To run this program, try the following. (Check the correct approach for the latest versions:
hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50
Surprisingly, one of our dev clusters had 22 seconds.
Another problem is file size.
If the file sizes are smaller than the HDFS block size, then Map / Reduce programs have significant overhead. Usually, Hadoop tries to create a mapper per block. This means that if you have 30 files per 5 KB, then Hadoop may end up with 30 cards appearing in 30 blocks per month, even if the file size is small. This is a real loss, since each overhead of the program is significant compared to the time when it will process small files.
pyfunc
source share