I did a couple of tests. The EXISTS
option is much faster - as I expected, and contrary to what @Tometzky posted .
Test bench with 10,000 rows on PostgreSQL 9.1.2 with decent settings:
CREATE TEMP TABLE test ( a serial ,b int NOT NULL ,c int NOT NULL ); INSERT INTO test (b,c) SELECT (random()* 100)::int AS b, (random()* 100)::int AS c FROM generate_series(1, 10000); ALTER TABLE test ADD CONSTRAINT a_pk PRIMARY KEY (a);
Between the first and second rounds of the tests, I ran:
ANALYZE test;
When I finally applied DELETE, 3368 duplicates were deleted. Performance may vary if you have significantly more or fewer duplicates.
I ran each request several times with EXPLAIN ANALYZE
and accepted the best result. As a rule, the best is hardly different from the first or worst.
A naked SELECT
(without DELETE
) shows similar results.
1. CTE with rank()
Total Run Time: 150.411 ms
Total run time: 149.853 ms - after ANALYZE
WITH x AS ( SELECT a ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk FROM test ) DELETE FROM test USING x WHERE xa = test.a AND rk > 1;
2. CTE with row_number()
Total run time: 148.240 ms
Total lead time: 147.711 ms - after ANALYZE
WITH x AS ( SELECT a ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn FROM test ) DELETE FROM test USING x WHERE xa = test.a AND rn > 1;
3. row_number()
in a subquery
Total Run Time: 134.753 ms
Total lead time: 134.298 ms - after ANALYZE
DELETE FROM test USING ( SELECT a ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn FROM test ) x WHERE xa = test.a AND rn > 1;
4. EXISTS
half connection
Total Run Time: 143.777 ms
Total lead time : 69.072 ms - after ANALYZE
DELETE FROM test t WHERE EXISTS ( SELECT 1 FROM test t1 WHERE t1.a < ta AND (t1.b, t1.c) = (tb, tc) );
The difference in the second run comes from switching to Hash Semi Join instead of additional sorting + merge merge.
results
EXISTS
clearly wins with up-tp-date table statistics.- With legacy statistics
row_number()
in the subquery is the fastest. rank()
is the slowest option.- CTE is slower than a subquery.
ANALYZE
(updated statistics) helps performance and can help a lot. Autovacuum (by default) should more or less take care of this automatically - with the exception of temporary tables or immediately after major changes to the table. More details here or here .
100,000 line test
I repeated the test with 100,000 rows and 63045 duplicates. Similar results, except that EXISTS
was slower even after ANALYZE
.
- Total run time: 1648.601 ms
- Total run time: 1623.759 ms
- Total lead time: 1568.893 ms
- Total run time: 1692.249 ms
Raising the target statistics to 1000, and then to a maximum of 10,000 (real-time overkill), and another ANALYZE
accelerated all requests by ~ 1%, but the query planner still went with Sort + Merge Semi Join for EXISTS
.
ALTER TABLE test ALTER COLUMN b SET STATISTICS 10000; ALTER TABLE test ALTER COLUMN c SET STATISTICS 10000; ANALYZE test;
Only after I made the scheduler avoid the merge merge did the scheduler use the Hash Semi Join, getting half the time:
SET enable_mergejoin = off
- Total run time: 850.615 ms
Update
Since then, query planners have been improved. Went straight to Hash Semi. Join re-testing with PostgreSQL 9.1.7.