2 main questions that bother me:
- How can I be sure that each of the 32 files that use the bush to store my tables is located on its unique machine?
- If this happens, how can I be sure that if the bush creates 32 cardboards, each of them will work on its local data? Does hasoop / hdfs have this magic, or does the hive as a smart app guarantee that this will happen?
Background: I have a cluster of a hive of 32 cars and:
- All my tables are created using
"CLUSTERED BY(MY_KEY) INTO 32 BUCKETS" - I use
hive.enforce.bucketing = true; - I checked, and indeed each table is stored as 32 files in the user / hive / storage folder
- I use HDFS 2 replication rate
Thank!