Clearing SQL data to a unique constraint

Question

Clearing SQL data to a unique constraint

I want to clear some data in a table before inserting a unique restriction on two columns.

CREATE TABLE test ( a integer NOT NULL, b integer NOT NULL, c integer NOT NULL, CONSTRAINT a_pk PRIMARY KEY (a) ); INSERT INTO test (a,b,c) VALUES (1,2,3) ,(2,2,3) ,(3,4,3) ,(4,4,4) ,(5,4,5) ,(6,4,4) ,(7,4,4); -- SELECT a FROM test WHERE ????

The output should be 2,6,7

I am looking for all lines after the first that are duplicated b,c

Example:

Lines 1,2 have (2,3) as b, c Line 1 is ok because it is first, 2 is not.
Rows 4,6,7 have (4,4) as b, c Row 4 is ok because it is first, 6,7 is not.

I then:

 DELETE FROM test WHERE a = those IDs;

.. and add a unique constraint.

I thought about crossing the test myself, but not sure where to go next.

+4

performance sql duplicate-removal postgresql unique

sjakubowski May 31, '12 at 15:34

source share

4 answers

Using functions should be much faster than this answer :

 select a from ( select a, rank() over (partition by b, c order by a) as rank from test ) as _ where rank>1;

+3

Tometzky May 31 '12 at 16:56

source share

 select oa from test o where exists ( select 'x' from test i where ic = oc and ib = ob and ia < oa );

Thanks to a colleague!

+2

sjakubowski May 31 '12 at 15:42

source share

I would try:

 delete from test where a not in ( select min(a) from test group by b,c)

From 20 to 60 ms on my machine, for what it costs, and the analysis does not affect the plan.

  Delete on test (cost=237.50..412.50 rows=5000 width=6) -> Seq Scan on test (cost=237.50..412.50 rows=5000 width=6) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> HashAggregate (cost=225.00..235.00 rows=1000 width=12) -> Seq Scan on test (cost=0.00..150.00 rows=10000 width=12)

0

David aldridge Mar 13 '13 at 18:51

source share

Erwin brandstetter · Accepted Answer · 2012-05-31T22:47:29+0000

I did a couple of tests. The EXISTS option is much faster - as I expected, and contrary to what @Tometzky posted .

Test bench with 10,000 rows on PostgreSQL 9.1.2 with decent settings:

 CREATE TEMP TABLE test ( a serial ,b int NOT NULL ,c int NOT NULL ); INSERT INTO test (b,c) SELECT (random()* 100)::int AS b, (random()* 100)::int AS c FROM generate_series(1, 10000); ALTER TABLE test ADD CONSTRAINT a_pk PRIMARY KEY (a);

Between the first and second rounds of the tests, I ran:

 ANALYZE test;

When I finally applied DELETE, 3368 duplicates were deleted. Performance may vary if you have significantly more or fewer duplicates.

I ran each request several times with EXPLAIN ANALYZE and accepted the best result. As a rule, the best is hardly different from the first or worst.
A naked SELECT (without DELETE ) shows similar results.

1. CTE with `rank()`

Total Run Time: 150.411 ms
Total run time: 149.853 ms - after ANALYZE

 WITH x AS ( SELECT a ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk FROM test ) DELETE FROM test USING x WHERE xa = test.a AND rk > 1;

2. CTE with `row_number()`

Total run time: 148.240 ms
Total lead time: 147.711 ms - after ANALYZE

 WITH x AS ( SELECT a ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn FROM test ) DELETE FROM test USING x WHERE xa = test.a AND rn > 1;

3. `row_number()` in a subquery

Total Run Time: 134.753 ms
Total lead time: 134.298 ms - after ANALYZE

 DELETE FROM test USING ( SELECT a ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn FROM test ) x WHERE xa = test.a AND rn > 1;

4. `EXISTS` half connection

Total Run Time: 143.777 ms
Total lead time : 69.072 ms - after ANALYZE

 DELETE FROM test t WHERE EXISTS ( SELECT 1 FROM test t1 WHERE t1.a < ta AND (t1.b, t1.c) = (tb, tc) );

The difference in the second run comes from switching to Hash Semi Join instead of additional sorting + merge merge.

results

EXISTS clearly wins with up-tp-date table statistics.
With legacy statistics row_number() in the subquery is the fastest.
rank() is the slowest option.
CTE is slower than a subquery.
ANALYZE (updated statistics) helps performance and can help a lot. Autovacuum (by default) should more or less take care of this automatically - with the exception of temporary tables or immediately after major changes to the table. More details here or here .

100,000 line test

I repeated the test with 100,000 rows and 63045 duplicates. Similar results, except that EXISTS was slower even after ANALYZE .

Total run time: 1648.601 ms
Total run time: 1623.759 ms
Total lead time: 1568.893 ms
Total run time: 1692.249 ms

Raising the target statistics to 1000, and then to a maximum of 10,000 (real-time overkill), and another ANALYZE accelerated all requests by ~ 1%, but the query planner still went with Sort + Merge Semi Join for EXISTS .

 ALTER TABLE test ALTER COLUMN b SET STATISTICS 10000; ALTER TABLE test ALTER COLUMN c SET STATISTICS 10000; ANALYZE test;

Only after I made the scheduler avoid the merge merge did the scheduler use the Hash Semi Join, getting half the time:

 SET enable_mergejoin = off

Total run time: 850.615 ms

Update

Since then, query planners have been improved. Went straight to Hash Semi. Join re-testing with PostgreSQL 9.1.7.

Clearing SQL data to a unique constraint

1. CTE with rank()

2. CTE with row_number()

3. row_number() in a subquery

4. EXISTS half connection

results

100,000 line test

Update

More articles:

1. CTE with `rank()`

2. CTE with `row_number()`

3. `row_number()` in a subquery

4. `EXISTS` half connection