I have a very large table in Hive from which we need to load a subset of partitions. It looks something like this:
CREATE EXTERNAL TABLE table1 ( col1 STRING ) PARTITIONED BY (p_key STRING);
I can load specific sections as follows:
SELECT * FROM table1 WHERE p_key = 'x';
with p_key is the key on which table1 is split. If I hardcode it directly in the WHERE clause, all of this is good. However, I have another query that calculates which partitions I need. It is more complicated than that, but let it be defined simply as:
SELECT DISTINCT p_key FROM table2;
So, now I would have to create a dirty query like this:
SELECT * FROM table1 WHERE p_key IN (SELECT DISTINCT p_key FROM table2);
Or written as an inner join:
SELECT t1.* FROM table1 t1 JOIN (SELECT DISTINCT p_key FROM table2) t2 ON t1.p_key = t2.p_key
However, when I run this, it takes enough time to let me believe that it performs a full table scan. In the explanation of the above queries, I also see that the result of the DISTINCT operation is used in the reducer and not in the cartographer, which means that the cartographer will not know which sections should be loaded or not. Of course, I'm not completely familiar with the release of Hive, so I can ignore something.
I found this page: MapJoin and Partition Pruning on the wiki and aggressive indicates that it was released in version 0.11.0. Therefore, I must have it.
Can this be done? If so, how?