Optimal block size in HDFS - large block sizes can hurt

Question

Optimal block size in HDFS - large block sizes can hurt

I understand the disadvantages of small files and small block sizes in HDFS. I am trying to understand the rationale for the default block size of 64/128 MB. Are there any flaws in the large block size (say 2 GB. I read that the larger values than the problems, details that I have not yet delved into).

The problems that I see with block sizes that are too large (please correct me, maybe some or all of these problems do not actually exist) -

Perhaps there may be problems with replicating the 1 Gig file when the node data is omitted - this requires the cluster to transfer the entire file. This seems to be a problem when we look at one file - but we might have to transfer many smaller files if we had smaller block sizes, say 128 MB (which, I think, is associated with a lot of overhead).
There may be problems with the cards. Large blocks can end with each cartographer, thus reducing the number of cartographers possible. But this should not be a problem if we use a smaller split size?
It sounded silly when it occurred to me that this could be a problem, but I thought I would drop it anyway. Since namenode does not know the file size in advance, this is possible. consider node data inaccessible, since it does not have enough disk space for a new block (given the large size of the block, it can be 1-2 gigabytes). But maybe this resolves this problem decisively by simply reducing the block size of that particular block (which is probably a bad solution).

The block size probably depends on the use case. I basically want to find the answer to the question: is there a situation / use case when large block sizes can damage?

Any help is appreciated. Thanks in advance.

+6

hadoop hdfs

Praneeth varma Jan 22 '14 at 22:26

source share

1 answer

javadba · Accepted Answer · 2014-01-24T14:31:18+0000

I did extensive performance evaluations of high-level clusters on hadoop, and we resized blocks from 64 megabytes to 2 GB. To answer the question: Imagine workloads in which you often have to process small files, say 10 from Megs. What block size, in your opinion, will be more efficient in this case - 64MEg or 1024Meg?

In the case of large files, then yes, large block sizes tend to increase productivity, since the overhead of the mappers is not negligible.

Optimal block size in HDFS - large block sizes can hurt

More articles: