Is it better to use HBase columns or serialize data with Avro?

Question

Is it better to use HBase columns or serialize data with Avro?

I am working on a project that stores key / value information for a user using HBase. We are in the process of redesigning the HBase scheme that we use. Two options are discussed:

Use HBase column qualifiers as key names. This would make the rows wide, but very sparse.
Drop all the data into one column and serialize it with Avro or Thrift.

What are the design tradeoffs between the two approaches? Is preferred by another? Are there any reasons not to store data using Avro or Thrift?

+8

java hbase

Shawn h Jan 29 '13 at 17:20

source share

2 answers

The correct answer to this is a little more complicated, so I will give you tl first; dr.

Use Avro / Thrift / Protobuf

You will need to find a balance between the number of fields that need to be packed in the record against the columns.

Usually you want to put the fields (“keys” in your original question) that are often accessed together, something like an avro entry, because, as cmonkey mentioned, you don't want the overhead of extracting extra data, t.

By making your row very wide, you will increase the search time when typing a subset of columns because of how the HFiles are stored. Again, determining what is optimal comes down to your access patterns.

I would also like to point out that by using something like avro, you also provide yourself with evolutionability. You do not need to delete the row and re-add it with an entry containing a new field. Avro has backward compatibility and advanced compatibility rules. This greatly simplifies your life, since you can read both new and old records WITHOUT overwriting your data or forcibly updating old client codes.

You should almost always use compression in HBase (SNAPPY is always a good choice).

+2

ramblingpolak Jan 28 '14 at 7:05

source share

cmonkey · Accepted Answer · 2013-01-29T17:44:10+0000

In general, I tend to use different columns per key.

1) Obviously, you are imposing that the client is using Avro / Thrift, which is another dependency. This dependency means that you can remove the ability of some tools, such as BI tools, that expect to find values in data without conversion.

2) As part of the Avro / Lean scheme, you are pretty much forced to carry all the cost over the wire. Depending on how much data in a row, this may not matter. But if you are only interested in the "city" / column -qualifier fields, you still have to receive "payments", "credit card-information", etc. It can also be a security issue.

3) Updates, if required, will be more complex with Avro / Thrift. Example: you decide to add the key 'hasIphone6'. Avro / Thrift: You will be forced to delete the line and create a new one with an added field. A new record with only a new column is added to the column layout. For a single line, not a big one, but if you do this up to a billion lines, a large summarization operation is required.

4) If configured, you can use compression in HBase, which can exceed avro / thrift serialization, as it can be compressed through a family of columns, not just a single record.

5) BigTable implementations, such as HBase, work very well with very wide, sparse tables, so there won't be as much performance improvement as you might expect.

Is it better to use HBase columns or serialize data with Avro?

More articles: