Here is a test demonstrating the effectiveness of three valid answers.
EXISTS excels with LEFT JOIN / GROUP BY :
Test setup
A table with 100 thousand rows, 1000 different values ββfor b .
The difference in performance has expanded with more rows - fewer duplicates means less difference.
No indexes.
CREATE TABLE tbl (a text, b text); INSERT INTO tbl SELECT (random()*10000)::int::text ,(random()*1000)::int || ' some more text here' FROM generate_series(1, 100000) g;
1. @Guffa : LEFT JOIN / GROUP BY / HAVING
EXPLAIN ANALYZE SELECT ta, tb FROM tbl t LEFT join tbl t2 on t2.b = tb and t2.a <> ta GROUP by ta, tb HAVING count(t2.a) >= 1;
2. The same thing, unravels only on JOIN / GROUP BY
EXPLAIN ANALYZE SELECT ta, tb FROM tbl t JOIN tbl t2 ON t2.b = tb AND t2.a <> ta GROUP BY ta, tb;
EXPLAIN ANALYZE SELECT * FROM tbl t WHERE EXISTS ( SELECT * FROM tbl t2 WHERE t2.a <> ta AND t2.b = tb );
EXPLAIN ANALYZE SELECT DISTINCT ta, tb FROM tbl t JOIN tbl t2 on t2.b = tb and t2.a <> ta;
β SQLfiddle displaying EXPLAIN ANALYZE output for queries .
- Total lead time : 12208.954 ms
- Total execution time: 11504.460 ms
- Total lead time : 272.508 ms -! ~ 45 times faster than 1.
- Total run time: 11540.627 ms
After adding an index with multiple columns ( SQLfiddle ).
CREATE INDEX a_b_idx ON tbl(b, a);
.. runtime does not change. Postgres does not use an index. Obviously, he expects sequential table scans to be faster since the whole table needs to be read anyway.
Besides the runtime, also pay attention to the number of lines , proving my point, as discussed:
JOIN creates many intermediate duplicates that the EXISTS version avoids:
EXPLAIN ANALYZE output for 1 .:
HashAggregate (cost = 230601.26..230726.26 rows = 10000 width = 31) (actual time = 12127.090..12183.087 rows = 99476 loops = 1)
Filter: (count (t2.a)> = 1)
-> Hash Left Join (cost = 3670.00..154661.89 rows = 10125250 width = 31) (actual time = 99.591..5897.744 rows = 9991102 loops = 1)
Hash Cond: (tb = t2.b)
Join Filter: (t2.a ta)
Rows Removed by Join Filter: 101052
-> Seq Scan on tbl t (cost = 0.00..1736.00 rows = 100000 width = 27) (actual time = 0.036..36.197 rows = 100000 loops = 1)
-> Hash (cost = 1736.00..1736.00 rows = 100000 width = 27) (actual time = 99.141..99.141 rows = 100000 loops = 1)
Buckets: 2048 Batches: 8 Memory Usage: 784kB
-> Seq Scan on tbl t2 (cost = 0.00..1736.00 rows = 100000 width = 27) (actual time = 0.004..44.899 rows = 100000 loops = 1)
Total runtime: 12208.954 ms
EXPLAIN ANALYZE output for 3 .:
Hash Semi Join (cost = 3670.00..7783.00 rows = 1 width = 27) (actual time = 81.630..247.371 rows = 100000 loops = 1)
Hash Cond: (tb = t2.b)
Join Filter: (t2.a ta)
Rows Removed by Join Filter: 1009
-> Seq Scan on tbl t (cost = 0.00..1736.00 rows = 100000 width = 27) (actual time = 0.010..32.758 rows = 100000 loops = 1)
-> Hash (cost = 1736.00..1736.00 rows = 100000 width = 27) (actual time = 81.388..81.388 rows = 100000 loops = 1)
Buckets: 2048 Batches: 8 Memory Usage: 784kB
-> Seq Scan on tbl t2 (cost = 0.00..1736.00 rows = 100000 width = 27) (actual time = 0.003..32.114 rows = 100000 loops = 1)
Total runtime: 272.508 ms