How can I be sure that the data is evenly distributed across the hadoop nodes?

Question

How can I be sure that the data is evenly distributed across the hadoop nodes?

If I copy data from the local system to HDFS, is it evenly distributed across the nodes?

PS HDFS ensures that each block is stored on three different nodes. But does this mean that all blocks of my files will be sorted on the same 3 nodes? Or does HDFS randomly select them for each new block?

+6

hadoop hdfs

yura Feb 21 '11 at 11:29

source share

4 answers

You can open the HDFS web interface on port 50070 of your namenode. It will show you information about data nodes. One thing you will see there is the used space on node.
If you do not have a user interface, you can look at the space used in the HDFS directories of data nodes.
If you have data skew, you can run a rebalancer, which will solve it gradually.

+3

David gruzman Feb 21 '11 at 15:59

source share

Now, with the Hadoop-385 patch, we can choose a block allocation policy to place all the blocks of the file in the same node (and similarly for replicated nodes). Read this blog about this topic - see the comments section.

+2

Mohamed Mar 01 '13 at 21:11

source share

Yes, Hadoop distributes data to a block, so each block will be distributed separately.

0

wlk Feb 21 '11 at 12:24

source share

Quinng · Accepted Answer · 2011-02-21T15:55:17+0000

If your replication is set to 3, it will be placed in 3 separate nodes. The number of nodes it is hosted on is controlled by your replication rate. If you need more distribution, you can increase the number of replication by editing $HADOOP_HOME/conf/hadoop-site.xml and changing the value of dfs.replication .

I believe that the new blocks are placed almost randomly. There is some consideration for spreading across racks (when the hadoop receives rack information). Example (I can’t find the link), if you have replication on 3 and 2 racks, 2 blocks will be in one rack, and the third block will be placed in another rack. I would suggest that there are no preferences shown for node getting blocks in a rack.

I have not seen anything indicating or indicating preference to store blocks of the same file on the same nodes.

If you are looking for ways to force balancing data across nodes (with replication at any value), a simple option is $HADOOP_HOME/bin/start-balancer.sh , which will start the balancing process to automatically move blocks around the cluster. This and several other balancing options can be found in the Hadoop FAQ.

Hope this helps.

How can I be sure that the data is evenly distributed across the hadoop nodes?

More articles: