When you join tables that are distributed by the same key and use these key columns in the join condition, each net-server SPU (machine) is 100% independent of the other (see nz-interview ).
In hive, it is merged into a card connection , but the distribution of files representing tables in the datanode is the responsibility of HDFS, this is not done according to the hive CLUSTERED BY key!
so suppose I have 2 tables CLUSTERED with the same key and I join this key - can the catch get a guarantee from HDFS that the corresponding buckets will sit on the same node? or will he always have to move the matching bucket of the small table to the datanode containing the large column of the table?
Thanks ido
(note: it is better to formulate my previous question: How does hive / hadoop guarantee that each handler works on data local to it? )
source
share