Is there a performance difference for these two Hive queries joining two tables and filtering by partition key?

Question

Is there a performance difference for these two Hive queries joining two tables and filtering by partition key?

Suppose table A and B have table ds .

Method 1

 SELECT * FROM A JOIN B ON A.userid=B.userid WHERE A.ds='2014-01-01' AND B.ds='2014-01-01'

Method 2

 SELECT * FROM ( SELECT * FROM A WHERE A.ds='2014-01-01' ) JOIN ( SELECT * FROM B WHERE B.ds='2014-01-01' ) ON A.userid=B.userid

Will the second query be faster?

I am wondering how WHERE and JOIN work in Hive. Is a WHERE applied to the source table before joining, when possible (while the clause contains only one table alias, such as the one above) or always applies only after joining tables (for example, A.userid > B.userid should be applied after joining) ?

+7

hql hive

Xiangpeng zhao Jan 16 '14 at 0:00

source share

1 answer

dimamah · Accepted Answer · 2014-01-17T13:17:37+0000

Your question is actually about the predicate in the hive.
Well, in the case above, the execution will be exactly the same as the hive will have the predicate A.ds='2014-01-01' AND B.ds='2014-01-01' to mappers before joining.

In a more general case, JOIN (inner join) is actually quite easy and can be summed up to:
If he can push, he will push.
It can click a predicate when only one table is involved ( where ax > 1 ), and cannot click if more than one table is involved ( A.userid > B.userid ), since the cartographer reads only the partition of one of the tables ..

The more complex part of OUTER JOIN and furtunelty is very clearly explained here .

PS
The pushdown predicate is controlled by hive.optimize.ppd , which is true by default.

Is there a performance difference for these two Hive queries joining two tables and filtering by partition key?

More articles: