How are HBase partitions broken down by region?

Question

How are HBase partitions broken down by region?

Please tell me how the HBase partition table divides the regions.

For example, let them say that my string keys are integers from 0 to 10 M, and I have 10 registers.
Does this mean that the first register server will save all lines with keys with values 0 - 10M, the second 1M - 2M, the third 2M - 3M, ... the tenth 9M - 10M?

I would like my string key to be timestamp, but in most cases requests will be applied to the latest dates, all requests will be processed by only one register server, is this true?

Or maybe this data will be distributed differently?
Or maybe I can somehow create more regions than the servers of the region, so (according to this example) server 1 will have the keys 0 - 0.5M and 3M - 3.5M, so my data will be distributed more evenly possible?

Update

I just found that there is an option hbase.hregion.max.filesize , do you think this will solve my problem?

+7

parallel-processing hbase hadoop

wlk Aug 05 '10 at 0:26

source share

2 answers

The hbase.hregion.max.filesize option, which by default 256MB sets the maximum region size, after reaching this limit, the region is divided. This means that my data will be stored in several regions with a size of 256 MB and possibly less.
So,

I would like my string key to be timestamp, but in most cases requests will be applied to the latest dates, all requests will be processed by only one register server, is this true?

This is not so, because the latest data will also be divided into 256 MB regions and stored on different regional servers.

0

wlk Aug 05 '10 at 20:04

source share

jdcryans · Accepted Answer · 2010-08-05T16:24:25+0000

WRT, you can read the Lars blog post

If your line key is only a timestamp, then yes, the region with the largest keys will always be hit by new requests (since the region is served by only one region server).

Do you want to use timestamps for a short scan? If so, consider salting your keys (google how Mozilla did this with Sorocco).

Can your timestamp prefix with any ID? For example, if you request data only for specific users, then the ts prefix with this user ID will give you a much better load distribution.

If not, use a UUID or anything else that will randomly distribute your keys.

About hbase.hregion.maxfilesize

Setting maxfilesize in this table (which you can do with the shell) does not mean that each area is exactly equal to X MB (where X is the value you set) is large. So, let your row keys be all timestamps, which means that each new row key is larger than the previous one. This means that it will always be inserted in the area with a blank key (the last one). At some point, one of the files will be larger than maxfilesize (via compression), and this region will be split in the middle. The lower keys will be in their own region, and the upper ones in another. But since your new row key is always larger than the previous one, this means that you will only write to this new area (and so on).

tl; dr, even if you have more than 1000 regions, with this scheme, the area with the largest row keys will always receive entries, which means that the server of the hosting area will become a bottleneck.

How are HBase partitions broken down by region?

More articles: