What is the difference between partitioning and balancing a table in Hive?

I know that both are executed in a table column, but how each operation is different.

+116
hadoop hive
Oct 02 '13 at 2:09 on
source share
9 answers

Partitioning data is often used to distribute the load horizontally; this has a performance advantage and helps to organize the data logically. Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results in a specific country or department. For a faster response request, the Hive table may be PARTITIONED BY (country STRING, DEPT STRING) . Partition tables change how Hive structured the data warehouse, and Hive now creates subdirectories that reflect the structure of the partition, such as

... / employees / countries = ABC / DEPT = XYZ.

If the request limit for an employee is from country=ABC , he will only scan the contents of one directory country=ABC . This can significantly improve query performance, but only if the separation scheme reflects general filtering. The partition function is very useful in Hive, however a design that creates too many partitions can optimize some queries, but be harmful to other important queries. Another drawback is that too many partitions are a large number of Hadoop files and directories that are created without the need and overhead for NameNode, since it must store all the metadata for the file system in memory.

Bucketing is another way to decompose datasets into more manageable parts. For example, suppose a table using date as the top-level partition and employee_id , since the second-level partition results in too many small partitions. Instead, if we maintain an employee table and use employee_id as the bucketing column, the value of this column will be hashed using a user-defined number in the buckets. Records with the same employee_id will always be stored in one bucket. Assuming that the number of employee_id much larger than the number of buckets, there will be many employee_id in each bucket. When creating the table, you can specify as CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets. Booking has several advantages. The number of buckets is fixed, so it does not fluctuate with the data. If two tables are enclosed in the employee_id statement, Hive can create a logically correct selection. Bucketing also helps in efficiently combining cards, etc.

+229
Oct 02 '13 at 6:37
source share

In the previous explanations, some details are missing. To better understand how partitioning and grouping works, you should look at how the data is stored in the bush. Let's say you have a table

 CREATE TABLE mytable ( name string, city string, employee_id int ) PARTITIONED BY (year STRING, month STRING, day STRING) CLUSTERED BY (employee_id) INTO 256 BUCKETS 

then the bush will store data in a directory hierarchy, like

 /user/hive/warehouse/mytable/y=2015/m=12/d=02 

So, you should be careful when splitting, because if, for example, you divide by employee_id and you have millions of employees, there will be millions of directories in your file system. The term "power" refers to the number of possible field values ​​that may have. For example, if you have a "country" field, there are about 300 countries in the world, so the number of elements will be ~ 300. For a field like timestamp_ms, which changes every millisecond, the number of elements can be billions. In general, when choosing a field to split, it should not have much power, because there will be too many directories in your file system.

On the other hand, clustering, known as grouping, will result in a fixed number of files, as you specify the number of segments. What the hive will do is take a field, calculate a hash, and assign a record to that segment. But what happens if you use, say, 256 segments, and the field in which you form the packet has low power (for example, it is the state of the USA, so there can only be 50 different values)? You will have 50 blocks with data and 206 blocks without data.

Someone already mentioned how partitions can significantly reduce the amount of requested data. So in my example table, if you want to make queries only from a certain date in advance, the separation by year / month / day will significantly reduce the number of I / O operations. I think someone also mentioned how grouping can speed up joining with other tables that have exactly the same distribution, so in my example, if you join two tables with one employee_id, the hive can do the join after the segment (even better if they are already sorted by employee_id, since they are going to sort the parts that are already sorted, which works in linear time, otherwise O (n)).

Thus, grouping works well when the field has high power and the data is evenly distributed across the segments. Separation works best when the number of elements in the separation field is not too large.

In addition, you can split into several fields in order (year / month / day is a good example), while you can use only one field.

+118
Dec 06 '15 at 23:42
source share

I think I'm late in answering this question, but it continues to rise in my channel.

Navneet provided an excellent answer. Adding to it visually.

Separation helps in eliminating the data if it is used in the WHERE clause, where since bucketing helps in organizing the data in each section into several files, since the same data set is always written in the same bucket. Helps in joining columns.

Suppose you have a table with five columns, name, server_date, some_col3, some_col4 and some_col5. Suppose you split a table into server_date and split it into a column of names in 10 buckets, your file structure will look something like this.

  • server_date = hug
    • 00000_0
    • 00001_0
    • 00002_0
    • ........
    • 00010_0

Here server_date = xyz is the partition, and 000 files are buckets in each partition. Buckets are computed based on some hash functions, so lines with the name = Sandy will always be in the same bucket.

+18
Sep 14 '15 at 12:19
source share

Separation of bushes:

Splitting divides a large amount of data into several slices based on the value of the column (s) of the table.

Suppose you store information about people all over the world spreading over countries with more than 50 times the number. If you want to request people from a particular country (Vatican city), in the absence of separation, you need to scan all 500 crawls of records, even to get thousands of records in the country. If you split the table based on the country, you can customize the query process by simply checking the data for only one section of the country. The Hive section creates a separate directory for the value of the column (s).

Pros:

  • Distribute run load horizontally
  • Faster query execution with a low data volume partition. for example, Get populations from the " Vatican City " returns very quickly, rather than looking for a whole population of the world.

Minuses:

  • Possibility of too many small partitioning - too many directories.
  • Effective for low volume data for this section. But some queries, such as a group on a large amount of data, still require a lot of time to complete. for example, the Chinese population grouping will take a long time compared to the Vatican population grouping. A section does not solve the problem of responsiveness in case of data skew towards a certain section value.

Bucketing for the hive:

Bucketing decomposes data into more manageable or equal parts.

When partitioning, it is possible to create several small partitions based on column values. If you go for bucketing, you limit the number of buckets for storing data. This number is defined in table creation scripts.

Pros

  • Due to equal amounts of data in each section, connections on the side of the card will be faster.
  • Quick response to a request, like markup

against

  • You can determine the number of buckets during table creation, but loading the same amount of data must be done manually by programmers.
+14
Nov 11 '15 at 7:01
source share

The difference bucketing divides files by column name, and partitioning divides files under a specific value inside the table

I hope I identified it correctly

+6
Jul 12 '16 at 14:57
source share

Before moving on to Bucketing , we need to understand what Partitioning . Let's take the table below as an example. Please note that I gave only 12 entries in the example below to understand the beginner level. In real-time scenarios, you can have millions of records.

enter image description here



PARTITIONING
---------------------
Partitioning used to get performance when querying data. For example, in the table above, if we write the sql below, you need to scan all the records in the table, which reduces performance and increases overhead.

 select * from sales_table where product_id='P1' 

In order to avoid a full table scan and only read entries related to product_id='P1' we can split (split the hive table files) into several files based on the product_id column. Thus, the hive table file will be divided into two files: one with product_id='P1' and the other with product_id='P2' . Now, when we execute the above request, it will only scan the product_id='P1' file.

 ../hive/warehouse/sales_table/product_id=P1 ../hive/warehouse/sales_table/product_id=P2 

The syntax for creating the section is given below. Note that we should not use the product_id column definition with unshared columns in the syntax below. This should only be in partitioned by .

 create table sales_table(sales_id int,trans_date date, amount int) partitioned by (product_id varchar(10)) 

Cons : we must be very careful when sharing. That is, it should not be used for columns in which the number of duplicate values ​​is very small (especially for primary key columns), as this increases the number of partitioned files and increases the overhead for a Name node .



BUCKETING
------------------
Bucketing used to overcome the cons that I mentioned in the splitting section. This should be used when there are very few duplicate values ​​in a column (for example, a primary key column). This is similar to the index concept for a primary key column in an RDBMS. In our table, we can take the column Sales_Id for grouping. This will be useful when we need to query the sales_id column.

The following is the syntax for grouping.

 create table sales_table(sales_id int,trans_date date, amount int) partitioned by (product_id varchar(10)) Clustered by(Sales_Id) into 3 buckets 

Here we further divide the data into several files on top of the partitions.

enter image description here

Since we specified 3 segments, it is divided into 3 files for each product_id . It internally uses the modulo operator by modulo operator to determine in which sales_id each sales_id should be stored. For example, for product_id='P1' sales_id=1 will be saved in file 000001_0 (i.e. 1% 3 = 1), sales_id=2 will be saved in file 000002_0 (i.e. E. 2% 3 = 2), sales_id=3 will be saved in the file 000000_0 (i.e. 3% 3 = 0), etc.

+2
Apr 07 '19 at 0:20
source share

There are great reviews here. I would like to be brief to remember the difference between sections and segments.

You usually divide a section into a less unique column. And a bucket on the most unique column.

An example, if you consider the world's population with the country, the name of the person and his biometric identifier as an example. As you can guess, the country field will be a less unique column, and the biometric identifier will be the most unique column. So ideally, you need to split the table by country and combine it by biometric identifier.

0
May 30 '19 at 14:24
source share

Using partitions in a Hive table is highly recommended for the main reasons -

Pasting into a Hive table should be faster (since multiple threads are used to write data to partitions) A query from a Hive table should be efficient with low latency.

Example: -

Suppose that an input file (100 GB) is loaded into a temporary table and contains banking data from different regions.

Beehive table without partitions

 Insert into Hive table Select * from temp-hive-table /hive-table-path/part-00000-1 (part size ~ hdfs block size) /hive-table-path/part-00000-2 .... /hive-table-path/part-00000-n 

The problem with this approach is that it will scan all the data for any query that you execute in this table. Response times will be high compared to other approaches that use partitioning and grouping.

Beehive table with partition

 Insert into Hive table partition(country) Select * from temp-hive-table /hive-table-path/country=US/part-00000-1 (file size ~ 10 GB) /hive-table-path/country=Canada/part-00000-2 (file size ~ 20 GB) .... /hive-table-path/country=UK/part-00000-n (file size ~ 5 GB) 

Pros - here you can access data faster when it comes to requesting data for certain geographic transactions. Cons - Insert / query data can be improved by splitting data in each section. See Bucketing Option below.

Casket table with partition and bucket

Note: Create a bush table ..... with "CLUSTERED BY (Partiton_Column) in 5 segments

 Insert into Hive table partition(country) Select * from temp-hive-table /hive-table-path/country=US/part-00000-1 (file size ~ 2 GB) /hive-table-path/country=US/part-00000-2 (file size ~ 2 GB) /hive-table-path/country=US/part-00000-3 (file size ~ 2 GB) /hive-table-path/country=US/part-00000-4 (file size ~ 2 GB) /hive-table-path/country=US/part-00000-5 (file size ~ 2 GB) /hive-table-path/country=Canada/part-00000-1 (file size ~ 4 GB) /hive-table-path/country=Canada/part-00000-2 (file size ~ 4 GB) /hive-table-path/country=Canada/part-00000-3 (file size ~ 4 GB) /hive-table-path/country=Canada/part-00000-4 (file size ~ 4 GB) /hive-table-path/country=Canada/part-00000-5 (file size ~ 4 GB) .... /hive-table-path/country=UK/part-00000-1 (file size ~ 1 GB) /hive-table-path/country=UK/part-00000-2 (file size ~ 1 GB) /hive-table-path/country=UK/part-00000-3 (file size ~ 1 GB) /hive-table-path/country=UK/part-00000-4 (file size ~ 1 GB) /hive-table-path/country=UK/part-00000-5 (file size ~ 1 GB) 

Pros - Quick insert. Quick inquiry.

Cons - Bucketing will create more files. In some cases, you may experience a problem with many small files.

Hope this helps!

0
Jun 24 '19 at 17:51
source share

Partitions: Partitions are basically horizontal slices of data that allow you to segment most of the data into more manageable chunks.

 CREATE TABLE customer ( id INT, name STRING, address1 STRING )PARTITION BY (REGION STRING,country STRING ); 

Multiple slicer or split columns are supported (ie region / country).

For the difference between splitting and balancing a table in Hive, this link

-one
Nov 03 '15 at 12:18
source share



All Articles