If your replication is set to 3, it will be placed in 3 separate nodes. The number of nodes it is hosted on is controlled by your replication rate. If you need more distribution, you can increase the number of replication by editing $HADOOP_HOME/conf/hadoop-site.xml and changing the value of dfs.replication .
I believe that the new blocks are placed almost randomly. There is some consideration for spreading across racks (when the hadoop receives rack information). Example (I canβt find the link), if you have replication on 3 and 2 racks, 2 blocks will be in one rack, and the third block will be placed in another rack. I would suggest that there are no preferences shown for node getting blocks in a rack.
I have not seen anything indicating or indicating preference to store blocks of the same file on the same nodes.
If you are looking for ways to force balancing data across nodes (with replication at any value), a simple option is $HADOOP_HOME/bin/start-balancer.sh , which will start the balancing process to automatically move blocks around the cluster. This and several other balancing options can be found in the Hadoop FAQ.
Hope this helps.
Quinng
source share