Column or row based for HBase

Question

Column or row based for HBase

I am wondering if HBase uses column based storage or row based storage?

I have read some white papers and the mentioned benefits of HBase use column-based storage to store similar data to facilitate compression. Thus, this means that the same columns of different rows are stored together;
But I also found out that HBase is a sorted map of key values. It uses a key to address all the relevant columns for that key (row), so it looks like row based storage?

It is clear whether someone can clarify my perplexities.

thanks in advance George

+8

hbase

George2 Aug 05 '12 at 12:55

source share

2 answers

In addition to Jan's excellent answer, I would suggest that HBase is both a row-based key value and a key store for columns (if you know a key-row).

If you prefer to think about it in terms of data structure, here's what a simple HBase table looks like:

'rowkey1' => { 'c:col1' => 'value1', 'c:col2' => 'value2', }, 'rowkey2' => { 'c:col1' => 'value10', 'c:col3' => 'value3' }

Of course, you can also store even more complex data structures in it, as you can see from the Ian presentation.

+2

Suman Aug 6 '12 at 20:38

source share

Ian varley · Accepted Answer · 2012-08-05T13:58:23+0000

George, here is the presentation I gave about understanding HBase circuits from HBaseCon 2012:

http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-hbasecon-2012.html

In short, each row in HBase is actually a key / value map, where you can have any number of columns (keys), each of which has a value. (And technically, each of them can have several values with different time stamps).

In addition, "column families" allow you to place multiple key / value cards on the same line in different physical (disk) files. This helps optimize the situation when you have sets of values that are usually accessed incoherently from other sets (therefore, you have less material to read from disk). The trade-off is that, of course, more work is needed to read all the values in a row if you split the columns into two column families, because it takes 2x the number of disk accesses.

Unlike the more standard "column-oriented" databases, I have never heard of anyone creating a HBase table that has a column family for each logical column. There are overheads associated with column families, and general recommendations usually have no more than 3 or 4 of them. The column family is information about the development time, that is, you must specify them at the time the table was created (or modified).

As a rule, I think that column families are an extended design option that you would use only after a deep understanding of the HBase architecture and show that it would be a net profit.

Thus, in general, although it is true that HBase can act as a "column-oriented", it is not the standard or most common design pattern in HBase. It’s better to think of it as a string repository with key / value cards.

Column or row based for HBase

More articles: