Find the difference between two large tables in PostgreSQL

Question

Find the difference between two large tables in PostgreSQL

I have two similar tables in Postgres with one 32-byte latin field (simple md5 hash). Both tables have ~ 30,000,000 rows. Tables have a slight difference (10-1000 rows are different)

Is it possible that Postgres will find the difference between these tables, the result should be 10-1000 lines described above.

This is not a real task, I just want to know how PostgreSQL deals with JOIN logic.

+8

sql exists left-join full-outer-join postgresql

odiszapc Mar 11 '13 at 2:46

source share

2 answers

Erwin brandstetter · Answer 1 · 2013-03-11T07:12:04+0000

The best option is perhaps EXISTS anti-semi-join:

tbl1 is the table with redundant rows in this example:

 SELECT * FROM tbl1 WHERE NOT EXISTS (SELECT 1 FROM tbl2 WHERE tbl2.col = tbl1.col);

If you don’t know which table has redundant rows or both have, you can either repeat the above query after switching the table names, or:

 SELECT * FROM tbl1 FULL OUTER JOIN tbl2 USING (col) WHERE tbl2 col IS NULL OR tbl1.col IS NULL;

An overview of the main methods in a later post:

Select rows not in another table.

By the way, it would be much more efficient to use uuid columns for md5 hashes:

Convert hex to text representation in decimal
Will index search be noticeably faster with char vs varchar when all values are 36 characters .

0xCAFEBABE · Answer 2 · 2013-03-11T07:45:08+0000

In my experience, NOT IN with a subquery takes a lot of time. I would do this with the connection turned on:

 DELETE FROM table1 where ID IN ( SELECT id FROM table1 LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield WHERE table2.hashfield IS NULL)

And then do the same for the other table.

Find the difference between two large tables in PostgreSQL

More articles: