The gears stopped working at 66.68% when running an HIVE Join request

Attempt to join 6 tables, each of which contains 5 million rows. Trying to join an account number, which is sorted ascending in all tables. The map tasks were successfully completed, and the gears stopped working at 66.68%. I tried options, such as increasing the number of gearboxes, and also tried other parameters set by hive.auto.convert.join = true; and set hive.hashtable.max.memory.usage = 0.9; and set hive.smalltable.filesize = 25000000L; but the result is the same. I tried with a small number of records (for example, 5000 lines), and the query works very well.

Please suggest what can be done here to make it work.

+4
source share
2 answers

66% gearboxes begin to make an actual decrease (0-33% - shuffle, 33-66% - sorting). In conjunction with the hive, the gearbox performs a Cartesian product between two data sets.

I am going to suggest that there is at least one foreign key that often appears in all datasets. Watch for null and default values.

For example, in a join, imagine that the "abc" key appears ten times in each of the six tables (10 ^ 6). That a million output records for this one key. If "abc" appears 1000 times in one table, 1000 in another, 1000 in another, then twice in three other tables, you get 8 billion records (1000 ^ 3 * 2 ^ 3). You can see how it gets out of hand. I assume that there is at least one key that leads to a large number of output records.

This is a common practice that should be avoided in RDBMSs outside the Hive. Making multiple internal connections between many-to-many relationships can cause you many problems.

+10
source

To debug this now and in the future, you can use JobTracker to search and study the logs for the reducer in question. Then you can apply the reduction operation to better understand what is happening. be careful, you certainly do not blow it! Try to see the number of entries entered for the reduction operation, for example.

0
source

All Articles