Clearing SQL data to a unique constraint

I want to clear some data in a table before inserting a unique restriction on two columns.

CREATE TABLE test ( a integer NOT NULL, b integer NOT NULL, c integer NOT NULL, CONSTRAINT a_pk PRIMARY KEY (a) ); INSERT INTO test (a,b,c) VALUES (1,2,3) ,(2,2,3) ,(3,4,3) ,(4,4,4) ,(5,4,5) ,(6,4,4) ,(7,4,4); -- SELECT a FROM test WHERE ???? 

The output should be 2,6,7

I am looking for all lines after the first that are duplicated b,c

Example:

  • Lines 1,2 have (2,3) as b, c Line 1 is ok because it is first, 2 is not.

  • Rows 4,6,7 have (4,4) as b, c Row 4 is ok because it is first, 6,7 is not.

I then:

 DELETE FROM test WHERE a = those IDs; 

.. and add a unique constraint.

I thought about crossing the test myself, but not sure where to go next.

+4
source share
4 answers

I did a couple of tests. The EXISTS option is much faster - as I expected, and contrary to what @Tometzky posted .

Test bench with 10,000 rows on PostgreSQL 9.1.2 with decent settings:

 CREATE TEMP TABLE test ( a serial ,b int NOT NULL ,c int NOT NULL ); INSERT INTO test (b,c) SELECT (random()* 100)::int AS b, (random()* 100)::int AS c FROM generate_series(1, 10000); ALTER TABLE test ADD CONSTRAINT a_pk PRIMARY KEY (a); 

Between the first and second rounds of the tests, I ran:

 ANALYZE test; 

When I finally applied DELETE, 3368 duplicates were deleted. Performance may vary if you have significantly more or fewer duplicates.

I ran each request several times with EXPLAIN ANALYZE and accepted the best result. As a rule, the best is hardly different from the first or worst.
A naked SELECT (without DELETE ) shows similar results.

1. CTE with rank()

Total Run Time: 150.411 ms
Total run time: 149.853 ms - after ANALYZE

 WITH x AS ( SELECT a ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk FROM test ) DELETE FROM test USING x WHERE xa = test.a AND rk > 1; 

2. CTE with row_number()

Total run time: 148.240 ms
Total lead time: 147.711 ms - after ANALYZE

 WITH x AS ( SELECT a ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn FROM test ) DELETE FROM test USING x WHERE xa = test.a AND rn > 1; 

3. row_number() in a subquery

Total Run Time: 134.753 ms
Total lead time: 134.298 ms - after ANALYZE

 DELETE FROM test USING ( SELECT a ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn FROM test ) x WHERE xa = test.a AND rn > 1; 

4. EXISTS half connection

Total Run Time: 143.777 ms
Total lead time : 69.072 ms - after ANALYZE

 DELETE FROM test t WHERE EXISTS ( SELECT 1 FROM test t1 WHERE t1.a < ta AND (t1.b, t1.c) = (tb, tc) ); 

The difference in the second run comes from switching to Hash Semi Join instead of additional sorting + merge merge.

results

  • EXISTS clearly wins with up-tp-date table statistics.
  • With legacy statistics row_number() in the subquery is the fastest.
  • rank() is the slowest option.
  • CTE is slower than a subquery.
  • ANALYZE (updated statistics) helps performance and can help a lot. Autovacuum (by default) should more or less take care of this automatically - with the exception of temporary tables or immediately after major changes to the table. More details here or here .

100,000 line test

I repeated the test with 100,000 rows and 63045 duplicates. Similar results, except that EXISTS was slower even after ANALYZE .

  • Total run time: 1648.601 ms
  • Total run time: 1623.759 ms
  • Total lead time: 1568.893 ms
  • Total run time: 1692.249 ms

Raising the target statistics to 1000, and then to a maximum of 10,000 (real-time overkill), and another ANALYZE accelerated all requests by ~ 1%, but the query planner still went with Sort + Merge Semi Join for EXISTS .

 ALTER TABLE test ALTER COLUMN b SET STATISTICS 10000; ALTER TABLE test ALTER COLUMN c SET STATISTICS 10000; ANALYZE test; 

Only after I made the scheduler avoid the merge merge did the scheduler use the Hash Semi Join, getting half the time:

 SET enable_mergejoin = off 
  1. Total run time: 850.615 ms

Update

Since then, query planners have been improved. Went straight to Hash Semi. Join re-testing with PostgreSQL 9.1.7.

+5
source

Using functions should be much faster than this answer :

 select a from ( select a, rank() over (partition by b, c order by a) as rank from test ) as _ where rank>1; 
+3
source
 select oa from test o where exists ( select 'x' from test i where ic = oc and ib = ob and ia < oa ); 

Thanks to a colleague!

+2
source

I would try:

 delete from test where a not in ( select min(a) from test group by b,c) 

From 20 to 60 ms on my machine, for what it costs, and the analysis does not affect the plan.

  Delete on test (cost=237.50..412.50 rows=5000 width=6) -> Seq Scan on test (cost=237.50..412.50 rows=5000 width=6) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> HashAggregate (cost=225.00..235.00 rows=1000 width=12) -> Seq Scan on test (cost=0.00..150.00 rows=10000 width=12) 
0
source

Source: https://habr.com/ru/post/1415402/


All Articles