Structure The difference between splitting and balancing in the hive

Question

Structure The difference between splitting and balancing in the hive

I created two tables:

1) One for pegging with bucketing
2) Only balancing table

I know the concepts of separation and bucketing in a beehive. But I'm a bit confused because I read that 'partition creates directory and bucketing creates files' . I agree with the first part, because I see that in the Hive HDFS repository, but I can not see the ONLY bucketing table files in HDFS, except for the data file that I uploaded to the table. So where are the ONLY bucketing table files? The files that I can see under the partitioned directory is what one is for eg: 00000_0 , but it can be for a partitioned table, but what about another table in square brackets?
Below is my code for creating the table:

 CREATE TABLE Employee( ID BIGINT, NAME STRING, SALARY BIGINT, COUNTRY STRING ) CLUSTERED BY(ID) INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

The code for the download data is as follows:

 load data local inpath '/home/cloudera/Desktop/SampleData.txt' into table employee;

I read that buckets are created when we create a table. Please correct me if I missed or changed something. Anyone help?

+7

hadoop hive hdfs cloudera hortonworks-data-platform

Siddhesh kalgaonkar Oct 9 '17 at 10:39

source share

2 answers

This is because you must force bucketing during insertion into the pivot table or create buckets for yourself. If you are inserting data into the bucket table, you can use the following flags.

  set hive.enforce.bucketing = true; -- (Note: Not needed in Hive 2.x onward)

This will make Hive create buckets. You should be able to see the number of files equal to the number of your buckets (if you have enough records and the correct distribution of the clustering column).

Update The Load command does not create any buckets; it simply puts the data in HDFS. You must load data into another table and insert data from one table into another using the insert overrite statement.

+3

hlagos Oct 14 '17 at 3:45

source share

Marco99 · Accepted Answer · 2017-10-19T05:56:39+0000

I created external hive tables (usually this is my choice). You can stick to yours.

Follow these steps:

Create database

 CREATE DATABASE IF NOT EXISTS testdb LOCATION '/hivedb/testdb';

Create a clustered table (table in square brackets)

 CREATE TABLE testdb.Employee( ID BIGINT, NAME STRING, SALARY BIGINT, COUNTRY STRING ) CLUSTERED BY(ID) INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/hivedb/testdb/employee';

Create a regular table

 CREATE TABLE testdb.Employee_plain_table( ID BIGINT, NAME STRING, SALARY BIGINT, COUNTRY STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/hivedb/testdb/employee_plain_table';

Force balancing as recommended by @lake in previous answer
```
 set hive.enforce.bucketing = true; 
```

Create a data file ('data.txt'). I created a data file with 20 records.

 1,AAAAA,1000.00,USA 2,BBBBB,2000.00,CANADA 3,CCCCC,3000.00,MEXICO 4,DDDDD,4000.00,BRAZIL 5,EEEEE,5000.00,ARGENTINA 6,DDDDD,6000.00,CHILE 7,FFFFF,7000.00,BOLIVIA 8,GGGGG,8000.00,VENEZUELA 9,HHHHH,9000.00,PERU 10,IIIII,10000.00,COLOMBIA 11,JJJJJ,11000.00,EQUADOR 12,KKKKK,12000.00,URUGUAY 13,LLLLL,13000.00,PARAGUAY 14,MMMMM,14000.00,GUYANA 15,NNNNN,15000.00,NICARAGUA 16,OOOOO,16000.00,PANAMA 17,PPPPP,17000.00,COSTA RICA 18,QQQQQ,18000.00,HAITI 19,RRRRR,19000.00,DOMINICA 20,SSSSS,20000.00,JAMAICA

Copy data file to HDFS location '/ Hivedb / TESTDB / employee_plain_table
```
 ./hadoop fs -put ~/so/data.txt /hivedb/testdb/employee_plain_table 
```
Run the select * command over testdb.Employee_plain_table
```
 select * from testdb.Employee_plain_table; 
```
This should display 20 entries.
Use the insert command
```
 insert overwrite table testdb.employee select * from employee_plain_table; 
```
This should start the work of shrinking the map and insert the entries into the bucketed table.
This will create 5 files, since we have 5 buckets according to the DDL table of employees.

Verify this with the command:

 ./hadoop fs -ls /hivedb/testdb/employee Found 5 items -rwxr-xr-x 1 hduser supergroup 95 2017-10-19 11:04 /hivedb/testdb/employee/000000_0 -rwxr-xr-x 1 hduser supergroup 81 2017-10-19 11:04 /hivedb/testdb/employee/000001_0 -rwxr-xr-x 1 hduser supergroup 90 2017-10-19 11:05 /hivedb/testdb/employee/000002_0 -rwxr-xr-x 1 hduser supergroup 88 2017-10-19 11:05 /hivedb/testdb/employee/000003_0 -rwxr-xr-x 1 hduser supergroup 84 2017-10-19 11:05 /hivedb/testdb/employee/000004_0

Open each file, compare with the original data file, and you will find out what happened.

Hope this clarifies your request! Link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

Update: you used the load with "local", this is only a copy operation, that is, it copies the given input file from the source to the destination. Downloading a command from "local" is a copy, and one of the "hdfs" is a move operation. No mapreduce is involved, so no quotation has occurred.

Structure The difference between splitting and balancing in the hive

More articles: