How does HBase distribute new regions from MapReduce across the cluster?

My situation is this: I have a 20-node Hadoop / HBase cluster with 3 ZooKeepers. I process a lot of data from HBase tables to other HBase tables through MapReduce.

Now, if I create a new table and tell any task to use this table as an output receiver, all its data will go to one register server. It would not surprise me if there were only a few regions. In a certain table, I have about 450 regions, and now the problem arises: most of these regions (about 80%) are on the same regional server!

I was wondering now how HBase distributes the assignment of new regions throughout the cluster and whether this behavior is normal / desirable or error. Unfortunately, I don’t know where to start looking for an error in my code.

I ask that this makes the work incredibly slow. Only when the tasks are fully completed, the table is balanced across the cluster, but this does not explain this behavior. Should HBase redistribute new regions at the time of creation to different servers?

Thanks for entering!

+4
source share
2 answers

I think this is a known issue. HBase currently distributes regions throughout the cluster as a whole, regardless of which table they belong to.

Refer to the HBase book for background: http://hbase.apache.org/book/regions.arch.html

You may be on an earlier version of hbase: http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/19155

Please note the following: a discussion of load balancing and moving the area http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/12549

0
source

By default, it simply balances the regions on each RS, not taking into account the table.

You can set hbase.master.loadbalance.bytable to get it.

0
source

All Articles