HBase random access performance and block size in HDFS

HBase can use HDFS as a server-side distributed file system. However, the default block size is completely different. HBase accepts 64 KB as the default block size, while HDFS uses at least 64 MB as the default block size, which is at least 1000 times larger than HBase.

I understand that HBase is for random access, so a lower block size is useful. But when accessing a 64K block in HBase, you still need to access a single 64 MB block in HDFS? If true, can HBase handle extremely random access well?

+8
hbase hdfs
source share
2 answers

Blocks are used for different things in HDFS and HBase. Blocks in HDFS are a unit of storage on disk. Blocks in HBase are a block of memory for memory. There are many HBase blocks that fit into a single HBase file. HBase is designed to maximize the performance of the HDFS file system and they take full advantage of the block size. Some people have even tweaked their HDFS to have 20 GB block sizes to make HBase more efficient.

One place to read more to understand what's going on behind the scenes in HBase is: http://hbase.apache.org/book.html#regionserver.arch

If you have completely random access to a table that is much larger than memory, then HBase cache will not help you. However, since HBase is reasonable in how it stores and retrieves data, it does not need to read the whole file block from HDFS to get the data needed for the request. Data is indexed by key, and it is efficient to retrieve. In addition, if you have well designed your keys for distributing data in your cluster, random reads will be read the same on every server, so the overall throughput will be maximum.

+7
source share

Hbase

HBase stores data in large files called HFiles; they are large in size (on the order of hundreds of MB or about GB).

When HBase wants to read, it first checks in memstore if the data is in memory from a recent update or insertion, if this data is not in memory, it will find HFiles having a range of keys that can contain the data you want (only 1 file if you are doing compression).

HFile contains many data blocks (by default, 64 KB HBase blocks), these blocks are small to provide fast random access. And at the end of the file there is an index that refers to all these blocks (with a range of keys in the block and the offset of the block in the file).

At the first reading of the HFile, the index is loaded and stored in memory for future calls, and then:

  • HBase does a binary index search (fast in memory) to find the block that potentially contains the key you requested
  • Once the block is located, HBase may ask the file system to read that particular 64k block with that particular offset in the file, causing one disk to attempt to load the block of data that you want to check.
  • A loaded 64k HBase block will look for the key you need, and the key value is returned if it exists

If you have small HBase blocks, you will have more efficient use of the disk when doing random access, but this will increase the size of the index and the need for memory.

HDFS

All file system calls are made using HDFS, which has blocks (64 MB by default). In HDFS, blocks are used to distribute and localize data, which means that a 1 GB file will be divided into 64 MB pieces, which will be distributed and replicated. These blocks are large, because to ensure that batch processing time is not only spent on finding a disk, since the data is adjacent in this fragment.

Conclusion

HBase blocks and HDFS blocks are two different things:

  • HBase blocks are an indexing unit (as well as caching and compression) in HBase and provide fast random access.
  • HDFS blocks are a unit of file system distribution and data location.

Adjusting the HDFS block size compared to your HBase settings and your needs will affect performance, but this is a more subtle issue.

+5
source share

All Articles