SQL Query - delete duplicates if more than three duplicates?

Does anyone have an elegant sql statement to remove duplicate records from a table, but only if the number of duplicates is greater than x? Thus, it allows up to 2 or 3 duplicates, but what is it?

I currently have a select statement that does the following:

delete table from table t left outer join ( select max(id) as rowid, dupcol1, dupcol2 from table group by dupcol1, dupcol2 ) as keeprows on t.id=keeprows.rowid where keeprows.rowid is null 

This works great. But now what I would like to do is delete only these lines if they have more than two duplicates.

thanks

+6
sql sql-server duplicates
source share
5 answers
 with cte as ( select row_number() over (partition by dupcol1, dupcol2 order by ID) as rn from table) delete from cte where rn > 2; -- or >3 etc 

The query produces a "line number" for each record, grouped by (dupcol1, dupcol2) and ordered by identifier. In fact, this line number counts "duplicates" that have the same dupcol1 and dupcol2 and assign, and then the number 1, 2, 3 .. N, the order by ID. If you want to keep only 2 duplicates, you need to delete those that were assigned numbers 3,4,.. N , and this is the part that will be taken care of DELLETE.. WHERE rn > 2;

Using this method, you can change the ORDER BY according to your preferred order (for example, ORDER BY ID DESC ), so that LATEST has rn=1 , and then the last one has rn = 2, etc. The rest remain the same, DELETE will only delete the oldest of them, since they have the highest line numbers.

In contrast to this closely related question , as the condition becomes more complex, the use of CTE and row_number () is simplified. Performance can be problematic if there is no appropriate access index.

+7
source share

HAVING is your friend

select id, count(*) cnt from table group by id having cnt>2

+3
source share

You can try the following query:

 DELETE FROM table t1 WHERE rowid IN (SELECT MIN(rowid) FROM table t2 GROUP BY t2.id,t2.name HAVING COUNT(rowid)>3); 
+1
source share

Pretty late, but a simple solution could be this: suppose we have an emp_dept (empid, deptid) table that has duplicate rows. Here I used @Count as varibale. 2 is duplicated and then @count = 2 in the Oracle database

  delete from emp_dept where @Count <= ( select count(1) from emp_dept i where i.empid = emp_dept.empid and i.deptid = emp_dept.deptid and i.rowid < emp_dept.rowid ) 

On a sql server or in any database that does not support a function like id of a row, we need to add an identifier column to identify each row. let's say we added nid as an identity to the table

 alter table emp_dept add nid int identity(1,1) -- to add identity column 

Now the duplicate removal request can be written as

  delete from emp_dept where @@Count <= ( select count(1) from emp_dept i where i.empid = emp_dept.empid and i.deptid = emp_dept.deptid and i.nid< emp_dept.nid ) 

Here, the concept deletes all rows for which there are other rows that have similar kernel values, but n or more smaller rowid or identity. Therefore, if duplicate rows exist, then one that has a higher row identifier or identifier will be deleted. and for the row there are no duplicates that cannot find the lower row identifier, therefore they will not be deleted.

0
source share

For Oracle:

  delete from test where rowid = ANY (select min(test.rowid) from test left outer join (select min(rowid) row_id from test group by id,name)t on test.rowid=t.row_id where t.row_id is null group by test.id,test.name); 
0
source share

All Articles