Dynamically load partitions into a predicate bush

I have a very large table in Hive from which we need to load a subset of partitions. It looks something like this:

CREATE EXTERNAL TABLE table1 ( col1 STRING ) PARTITIONED BY (p_key STRING); 

I can load specific sections as follows:

 SELECT * FROM table1 WHERE p_key = 'x'; 

with p_key is the key on which table1 is split. If I hardcode it directly in the WHERE clause, all of this is good. However, I have another query that calculates which partitions I need. It is more complicated than that, but let it be defined simply as:

 SELECT DISTINCT p_key FROM table2; 

So, now I would have to create a dirty query like this:

 SELECT * FROM table1 WHERE p_key IN (SELECT DISTINCT p_key FROM table2); 

Or written as an inner join:

 SELECT t1.* FROM table1 t1 JOIN (SELECT DISTINCT p_key FROM table2) t2 ON t1.p_key = t2.p_key 

However, when I run this, it takes enough time to let me believe that it performs a full table scan. In the explanation of the above queries, I also see that the result of the DISTINCT operation is used in the reducer and not in the cartographer, which means that the cartographer will not know which sections should be loaded or not. Of course, I'm not completely familiar with the release of Hive, so I can ignore something.

I found this page: MapJoin and Partition Pruning on the wiki and aggressive indicates that it was released in version 0.11.0. Therefore, I must have it.

Can this be done? If so, how?

+6
source share
1 answer

I'm not sure how to help with MapJoin, but in the worst case, you can dynamically create a second query with something like:

 SELECT concat('SELECT * FROM table1 WHERE p_key IN (', concat_ws(',',collect_set(p_key)), ')') FROM table2; 

then follow the result. In this case, the query processor should be able to trim unwanted partitions.

0
source

All Articles