The bush is grouped into more than one column.

I understand that when the hive table is clustered by one column, then it performs the hash function of this column in the column, and then puts this row of data in one of the buckets. And there is a file for each bucket, i.e. If there are 32 buckets, then there are 32 files in hdfs.

What does it mean to have clustering on more than one column? For example, let's say that the table has CLUSTERED BY (continent, country) INTO 32 BUCKETS.

How would a hash function be performed if there is more than one column?

How many files will be created? Is it 32 more?

+4
source share
2 answers
  • Yes, the number of files will be 32.
  • , ", " , .

, !

+4

, hash_function (bucketing_column) mod num_buckets. ( "0x7FFFFFFF", ). hash_function bucketing. int , hash_int (i) == i. , user_id int, 10 , , user_id, 0, bucket 1, user_id, 1, 2 .. . , BIGINT - , BIGINT. , , - . , user_id STRING, user_id 1, , 0. .

ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

0
source

All Articles