A left outer join in BigQuery on multiple keys is not if one of them is zero

I came across something strange with left outer joins on multiple keys in BigQuery. If one of the keys is null in the right table, it does not match, and therefore it puts null in all the values ​​for the right table.

If I have 2 tables

  Table1 Table2 ------------ ---------------- k1 |k2 |v3 k1 |k2 | v4 ------------ ---------------- foo|boo |hey foo|NULL| you bar|test|yo bar|test| hi foo|NULL|hey 

and I do the following connection

 SELECT t1.k1, t1.k2, t1.v3, t2.v4 FROM [Table1] t1 LEFT OUTER JOIN EACH [Table2] t2 ON t1.k1=t2.k1 AND t1.k2=t2.k2 

I get this result

 t1_k1|t1_k2|t1_v3|t2_v4 ----------------------- foo |boo |hey |NULL --No match here so NULL in t2_v4 it ok bar |test |hey |hi --It matches here on bar and test foo |NULL |hey |NULL --It doesn't match on foo NULL. 

I would expect the last line to be

 foo |NULL |hey |you 

Is this expected behavior? With this, I do not mean null matches.

Is there any other way to create the result that I want?

+6
source share
2 answers

In SQL, NULL not equal to anyone - not even yourself.

Gordon's answer contains some reasonable suggestions, but note that BigQuery only supports join conditions, which are equal unions, which excludes the use of OR or IS NULL .

Can you use a non-zero checkpoint value? For example, if you replace NULL an empty string (or the string "null" or something else that does not happen elsewhere in your data), the connection will work as you would expect. You can even do this on the fly using a subquery with minimal performance overhead.

 SELECT t1.k1, t1.k2, t1.v3, t2.v4 FROM (SELECT IFNULL(k1, "null") k1, IFNULL(k2, "null") k2 FROM [Table1]) t1 LEFT OUTER JOIN EACH (SELECT IFNULL(v3, "null") v3, IFNULL(v4, "null") v4 FROM [Table2]) t2 ON t1.k1 = t2.k1 AND t1.k2 = t2.k2 

You can add an external selection to turn the string "null" back to real NULL .

Obviously, this only works if the string "null" does not occur elsewhere in your data.

+6
source

This is standard behavior. One way to fix this is to make the join clause more complex:

 SELECT t1.k1, t1.k2, t1.v3, t2.v4 FROM [Table1] t1 LEFT OUTER JOIN EACH [Table2] t2 ON (t1.k1 = t2.k1 OR (t1.k1 is null and t2.k1 is null)) AND (t1.k2 = t2.k2 OR (t1.k2 is null and t2.k2 is null)) 

This can adversely affect performance. Some databases have a null safe equality operation that can be used, but I don't think BigQuery supports this.

So, the best advice is to fix the data, so NULL not a valid key value. Otherwise, you can use multiple connections. If you had only one join key, it would look like

 SELECT t1.k1, t1.k2, t1.v3, coalesce(t2.v4, t2null.v4) FROM [Table1] t1 LEFT OUTER JOIN EACH [Table2] t2 ON t1.k1 = t2.k1 LEFT OUTER JOIN EACH [Table2] t2null ON (t1.k1 is null and t2.k1 is null) ; 

However, it is harder to assemble for composite join keys.

+1
source

All Articles