How does Hive compare to HBase?

I am interested to know how recently released ( http://mirror.facebook.com/facebook/hive/hadoop-0.17/ ). Hive compares with HBase in terms of performance. The SQL-like interface used by Hive is very preferable for the HBase API that we implemented.

+56
hbase hadoop hive
Aug 23 '08 at 12:22
source share
7 answers

It's hard to find a lot about Hive, but I found this snippet on the Hive site that relies heavily on HBase (in bold)

Hive is based on Hadoop, which is a batch processing system. Accordingly, this system does not and cannot promise low latency on requests . The paradigm here is rigorous assignment and notification when assignments are completed as opposed to real-time queries. As a result, it cannot be compared with systems such as Oracle, where the analysis is performed on a much smaller amount of data, but the analysis is much more iterative when the response time between iterations is less than a few minutes. The response time for Hive requests for the smallest tasks can be about 5-10 minutes, and for larger tasks this can even work in hours.

Because HBase and HyperTable are all about performance (modeled on Google BigTable), they sound like they will be much faster than Hive at the cost of functionality and a higher learning curve (for example, they don't have joins or SQL-like syntax).

+48
Aug 30 '08 at 22:16
source share

On the one hand, Hive consists of five main components: SQL-like grammar and parser, query planner, query execution mechanism, metadata repository and column storage layout. Its main focus is on analytic workloads in the style of a data warehouse, so slow key extraction is not required.

HBase has its own metadata repository and column storage locations. You can query HiveQL queries against HBase tables, allowing HBase to take advantage of the beehive grammar and analyzer, query planner, and query execution engine. See http://wiki.apache.org/hadoop/Hive/HBaseIntegration for more details.

+11
Jun 04 '10 at 4:38
source share

Hive is an analytics tool. Like pigs, it was designed for special batch processing of potentially huge amounts of data by reducing the map. Think about terabytes. Imagine trying to do this in a relational database ...

HBase is a BigTable-based column-based key value repository. You cannot make requests as such, although you can do the job of reducing the map compared to HBase. In the main case, the selection of strings by keywords or the range of scanning strings is used. The main feature is the ability to locate data when scanning by row key ranges for a "family" of columns.

+8
Jun 25 '09 at 21:38
source share

To my humble knowledge, Ul is more comparable to Pig. The hive is SQL-like, and Pig is a script. The hive seems more complex with query and execution optimization mechanisms, and also requires end users to specify schema parameters (section, etc.). Both are for processing text files or sequence files.

HBase is designed to store key data and retrieve ... you can scan or filter these pairs of values ​​(strings). You cannot execute queries row by row (key, value).

+5
Jun 06 2018-10-06T00:
source share

Since the most recent releases of Hive, much has changed, which requires a little update, as Hive and HBase are now integrated . This means that Hive can be used as a query layer for the HBase data warehouse. Now, if people are looking for alternative HBase interfaces, Pig also offers a really good way to load and store HBase data . Additionally, it seems that Cloudera Impala can offer substantial Hive-based queries on top of HBase. They require up to 45 times faster queries compared to traditional Hive installations.

+3
Feb 05
source share

Hive and HBase are used for different.

Hive:

Pros:

  • Apache Hive is a data warehouse infrastructure built on top of Hadoop.
  • It allows you to query the data stored on HDFS for analysis using HQL , an SQL-type language that will be converted into a number of map reduction jobs
  • It runs batch processes on Hadoop.
  • compatible with JDBC, it also integrates with existing SQL-based tools.
  • Hive supports partitions
  • It supports analytical querying of data collected over a period of time.

Minuses:

  • It does not currently support update instructions.
  • It must be provided with a predefined scheme for mapping files and directories to columns

HBase:

Pros:

  • Scalable distributed database supporting structured data storage for large tables
  • It provides random real-time read / write access to your Big Data. HBase operations are performed in real time in their database, not in MapReduce jobs.
  • it supports partitions for tables, and tables are further broken down into column families
  • Hugeop horizontal scaling with massive amounts of data
  • Provides key-based access to data when stored or retrieved. It supports adding or updating rows.
  • Support for data access rights.

Minuses:

  • HBase requests are written in the user language that you need to learn.
  • HBase is not fully compatible with ACID.
  • It cannot be used with complex access patterns (e.g. associations)
  • It is also not a complete replacement for HDFS when running the large MapReduce package.

Summary:

Hive can be used for analytic queries, and HBase can be used for real-time queries. Data can even be read and written from Hive to HBase and vice versa.

+3
Feb 15 '16 at 17:21
source share

To compare Hive with Hbase, I would like to recall the definition below:

A database designed for transaction processing is not intended for processing an analyst. It is not structured to analyze well. The data warehouse, on the other hand, is structured to make analytics quick and easy.

Hive is a data warehouse infrastructure built on top of Hadoop that is suitable for long-running ETL jobs. Hbase is a real-time transaction processing database.

0
May 11 '15 at 8:19
source share



All Articles