Hbase and HFiles. How does it store a column family?

Question

Hbase and HFiles. How does it store a column family?

If you have a column family, are all the columns for rowkey in the same HFile? Can data from rowkey and the same column family mix in different HFiles ?. This is because I thought they were sorted, but I read in a book:

Data from the same column family for a single row does not have to be stored in the same HFile. . Why can a string be too big and doesn't fit any HFile?

The only requirement is that inside the HFile, the data for the row column family is stored together. It seems a bit controversial to me.

Note: I read a little about the topic. HBase uses the LSM tree. I have a rowkey and all the data in one HFile. Later I could add some new data, it will be stored in memory, when the memory is full, HBase will save this data in a new HFile. That way, I could have qualifiers for one line in two HFiles. If I want to perform the operation of receiving or scanning this line, I will have to search in two files. Over time, HBase will perform a large compaction, it will only create an HFile connecting the two old HFiles and removing them after compaction. So, if I want to find this line, I need only one search. I'm right?? I did not understand why a small and large compaction occurs because they seem to do the same.

+7

hbase

Guille Mar 29 '14 at 14:13

source share

3 answers

Curious · Answer 1 · 2014-03-30T00:38:39+0000

The column family is a collection of HFiles. If you look at the directory structure of the table, it looks like this:

/ table / region-id / column-family1 / [list of HFiles]
/ table / region-id / column-family2 / [list of HFiles]

These HFiles are immutable and sorted. When reading, the scanner (reading data) ensures that it takes into account all HFiles when reading data for the row key and the specified column family.

Data from the same column family for a single row should not be stored in the same HFile. So it is true.

The second bold statement, it can be obtained from the fact that the data in the HFile is sorted, so in this HFile the data associated with the row key is stored together.

Tyro · Answer 2 · 2015-05-11T10:42:40+0000

Yes, that's right. Difference:

Minor seals are designed to minimize HBase performance, so there is an upper limit to the number of HFiles involved. They are relatively light and occur more frequently. Basic compilation is the only chance that HBase should clear deleted records. Deleting permission requires deletion of both the deleted record and the deletion token. Theres does not guarantee that both the record and marker are in the same HFile.

In addition, small commands are run every time memstore is cleared, and merging some storage files. While the main transactions are launched every 24 hours and merge all the storage files into one. 24 hours are adjusted with an arbitrary difference of up to 20% to avoid a simultaneous large number of large operations. Basic commands can also be launched manually through the API or the shell.

There is another difference between minor and main transactions: the main operations with the processing of deleting tokens, maximum versions, etc., while minor copies are not performed.

linehrr · Answer 3 · 2017-01-02T00:18:57+0000

column families are stored in separate HFiles. thus, each column family has its own HFile. it also means that the row key will be duplicated in these different HFiles, therefore it is officially recommended to keep cf as low as possible (<= 3 per table).

Hbase and HFiles. How does it store a column family?

More articles: