Optimal block size in HDFS - large block sizes can hurt

I understand the disadvantages of small files and small block sizes in HDFS. I am trying to understand the rationale for the default block size of 64/128 MB. Are there any flaws in the large block size (say 2 GB. I read that the larger values โ€‹โ€‹than the problems, details that I have not yet delved into).

The problems that I see with block sizes that are too large (please correct me, maybe some or all of these problems do not actually exist) -

  • Perhaps there may be problems with replicating the 1 Gig file when the node data is omitted - this requires the cluster to transfer the entire file. This seems to be a problem when we look at one file - but we might have to transfer many smaller files if we had smaller block sizes, say 128 MB (which, I think, is associated with a lot of overhead).

  • There may be problems with the cards. Large blocks can end with each cartographer, thus reducing the number of cartographers possible. But this should not be a problem if we use a smaller split size?

  • It sounded silly when it occurred to me that this could be a problem, but I thought I would drop it anyway. Since namenode does not know the file size in advance, this is possible. consider node data inaccessible, since it does not have enough disk space for a new block (given the large size of the block, it can be 1-2 gigabytes). But maybe this resolves this problem decisively by simply reducing the block size of that particular block (which is probably a bad solution).

The block size probably depends on the use case. I basically want to find the answer to the question: is there a situation / use case when large block sizes can damage?

Any help is appreciated. Thanks in advance.

+6
source share
1 answer

I did extensive performance evaluations of high-level clusters on hadoop, and we resized blocks from 64 megabytes to 2 GB. To answer the question: Imagine workloads in which you often have to process small files, say 10 from Megs. What block size, in your opinion, will be more efficient in this case - 64MEg or 1024Meg?

In the case of large files, then yes, large block sizes tend to increase productivity, since the overhead of the mappers is not negligible.

+2
source

All Articles