Find the difference between two large tables in PostgreSQL

I have two similar tables in Postgres with one 32-byte latin field (simple md5 hash). Both tables have ~ 30,000,000 rows. Tables have a slight difference (10-1000 rows are different)

Is it possible that Postgres will find the difference between these tables, the result should be 10-1000 lines described above.

This is not a real task, I just want to know how PostgreSQL deals with JOIN logic.

+8
sql exists left-join full-outer-join postgresql
source share
2 answers

The best option is perhaps EXISTS anti-semi-join:

tbl1 is the table with redundant rows in this example:

 SELECT * FROM tbl1 WHERE NOT EXISTS (SELECT 1 FROM tbl2 WHERE tbl2.col = tbl1.col); 

If you don’t know which table has redundant rows or both have, you can either repeat the above query after switching the table names, or:

 SELECT * FROM tbl1 FULL OUTER JOIN tbl2 USING (col) WHERE tbl2 col IS NULL OR tbl1.col IS NULL; 

An overview of the main methods in a later post:

  • Select rows not in another table.

By the way, it would be much more efficient to use uuid columns for md5 hashes:

+18
source share

In my experience, NOT IN with a subquery takes a lot of time. I would do this with the connection turned on:

 DELETE FROM table1 where ID IN ( SELECT id FROM table1 LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield WHERE table2.hashfield IS NULL) 

And then do the same for the other table.

-one
source share

All Articles