I understand the disadvantages of small files and small block sizes in HDFS. I am trying to understand the rationale for the default block size of 64/128 MB. Are there any flaws in the large block size (say 2 GB. I read that the larger values โโthan the problems, details that I have not yet delved into).
The problems that I see with block sizes that are too large (please correct me, maybe some or all of these problems do not actually exist) -
Perhaps there may be problems with replicating the 1 Gig file when the node data is omitted - this requires the cluster to transfer the entire file. This seems to be a problem when we look at one file - but we might have to transfer many smaller files if we had smaller block sizes, say 128 MB (which, I think, is associated with a lot of overhead).
There may be problems with the cards. Large blocks can end with each cartographer, thus reducing the number of cartographers possible. But this should not be a problem if we use a smaller split size?
It sounded silly when it occurred to me that this could be a problem, but I thought I would drop it anyway. Since namenode does not know the file size in advance, this is possible. consider node data inaccessible, since it does not have enough disk space for a new block (given the large size of the block, it can be 1-2 gigabytes). But maybe this resolves this problem decisively by simply reducing the block size of that particular block (which is probably a bad solution).
The block size probably depends on the use case. I basically want to find the answer to the question: is there a situation / use case when large block sizes can damage?
Any help is appreciated. Thanks in advance.
source share