Comparing Two Tables for Equality in HIVE

I have two tables, table1 and table2. Each of them has the same columns:

key, c1, c2, c3 

I want to check if these tables are equal to each other (they have the same rows). So far, I have these two queries (<> = not equal in HIVE):

 select count(*) from table1 t1 left outer join table2 t2 on t1.key=t2.key where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3 

and

 select count(*) from table1 t1 left outer join table2 t2 on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3 where t2.key is null 

So my idea is that if a null count is returned, the tables are the same. However, I get a score of zero for the first request and a score of non-zero for the second request. How do they differ from each other? If there is a better way to test this, let me know.

+9
source share
8 answers

The first excludes lines where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2 or t2.c3 are zero. This means that you are effectively performing an inner join.

The second will find the rows that exist in t1, but not in t2.

To find strings that exist in t2 but not in t1, you can do a full outer join. The following SQL assumes all columns are NOT NULL :

 select count(*) from table1 t1 full outer join table2 t2 on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3 where t1.key is null /* this condition matches rows that only exist in t2 */ or t2.key is null /* this condition matches rows that only exist in t1 */ 
+12
source

If you want to check for duplicates, and the tables have exactly the same structure, and the tables do not have duplicates inside them, you can do:

 select t.key, t.c1, t.c2, t.c3, count(*) as cnt from ((select t1.*, 1 as which from table1 t1) union all (select t2.*, 2 as which from table2 t2) ) t group by t.key, t.c1, t.c2, t.c3 having cnt <> 2; 

There are various ways in which you can, if necessary, relax the conditions in the first paragraph.

Note that this version also works when columns are NULL . This can cause problems with your data.

+6
source

I would recommend you not use JOIN to compare tables:

  • These are quite expensive operations when tables are large (as is often the case in Hive)
  • this can cause problems when some lines / connection keys are repeated

(and it can also be impractical when the data is in different clusters / data centers / clouds).

Instead, I think that using a checksum approach is better than comparing the checksums of both tables.

I developed a Python script that makes it easy to make such a comparison and see the differences in a web browser:

https://github.com/bolcom/hive_compared_bq

I hope this can help you!

+1
source

another variant

 select c1-c2 "different row counts" , c1-c3 "mismatched rows" from ( select count(*) c1 from table1) ,( select count(*) c2 from table2 ) ,(select count(*) c3 from table1 t1, table2 t2 where t1.key= t2.key and T1.c1=T2.c1 ) 
0
source

Try with the WITH clause:

 With cnt as( select count(*) cn1 from table1 ) select 'X' from dual,cnt where cnt.cn1 = (select count(*) from table2); 
0
source

One simple solution is internal join. Suppose we have two hive tables, namely table1 and table2. Both tables have the same column: col1, col2 and col3. The number of lines should also be the same. Then the command will be as follows:

**

 select count(*) from table1 inner join table2 on table1.col1 = table2.col1 and table1.col2 = table2.col2 and table1.col3 = table2.col3 ; 

**

If the output value matches the number of rows in table1 and table2, then all columns have the same value. If, however, the amount of output is less than there is some data that is different.

0
source

First get the score for tables C1 and C2. C1 and C2 must be equal. C1 and C2 are available upon request

 select count(*) from table1 

if C1 and C2 are not equal, then the tables are not identical.

2: Find a different counter for tables DC1 and DC2. DC1 and DC2 must be equal. The number of different entries can be found using the following query:

 select count(*) from (select distinct * from table1) 

if DC1 and DC2 are not equal, the tables are not identical.

3: Now get the number of records obtained by joining two tables. Let it be U. Use the following query to get the number of records in a join of 2 tables:

 SELECT count (*) FROM (SELECT * FROM table1 UNION SELECT * FROM table2) 

We can say that the data in 2 tables are identical if the different number for 2 tables is equal to the number of records obtained by combining the two tables. those. DC1 = U and DC2 = U

0
source

Use the MINUS statement:

 SELECT count(*) FROM (SELECT t1.c1, t1.c2, t1.c3 from table1 t1 MINUS SELECT t2.c1, t2.c2, t2.c3 from table2 t2) 
-2
source

All Articles