Hbase
HBase stores data in large files called HFiles; they are large in size (on the order of hundreds of MB or about GB).
When HBase wants to read, it first checks in memstore if the data is in memory from a recent update or insertion, if this data is not in memory, it will find HFiles having a range of keys that can contain the data you want (only 1 file if you are doing compression).
HFile contains many data blocks (by default, 64 KB HBase blocks), these blocks are small to provide fast random access. And at the end of the file there is an index that refers to all these blocks (with a range of keys in the block and the offset of the block in the file).
At the first reading of the HFile, the index is loaded and stored in memory for future calls, and then:
- HBase does a binary index search (fast in memory) to find the block that potentially contains the key you requested
- Once the block is located, HBase may ask the file system to read that particular 64k block with that particular offset in the file, causing one disk to attempt to load the block of data that you want to check.
- A loaded 64k HBase block will look for the key you need, and the key value is returned if it exists
If you have small HBase blocks, you will have more efficient use of the disk when doing random access, but this will increase the size of the index and the need for memory.
HDFS
All file system calls are made using HDFS, which has blocks (64 MB by default). In HDFS, blocks are used to distribute and localize data, which means that a 1 GB file will be divided into 64 MB pieces, which will be distributed and replicated. These blocks are large, because to ensure that batch processing time is not only spent on finding a disk, since the data is adjacent in this fragment.
Conclusion
HBase blocks and HDFS blocks are two different things:
- HBase blocks are an indexing unit (as well as caching and compression) in HBase and provide fast random access.
- HDFS blocks are a unit of file system distribution and data location.
Adjusting the HDFS block size compared to your HBase settings and your needs will affect performance, but this is a more subtle issue.
Geoffrey
source share