The fastest way to delete duplicate data

Question

The fastest way to delete duplicate data

After searching stackoverflow.com, I found a few questions on how to remove duplicates, but none of them indicated speed.

In my case, I have a table with 10 columns containing 5 million exact rows. In addition, I have at least a million other rows with duplicates in 9 out of 10 columns. My current technique takes (for now) 3 hours to remove the 5 million lines. Here is my process:

-- Step 1: **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table select MAX(prikey) as MaxPriKey, -- identity(1, 1) a, b, c, d, e, f, g, h, i into #dupTemp FROM sourceTable group by a, b, c, d, e, f, g, h, i having COUNT(*) > 1

Further

 -- Step 2: **This step is taking the 3+ hours** -- delete the row when all the non-unique columns are the same (duplicates) and -- have a smaller prikey not equal to the max prikey delete from sourceTable from sourceTable inner join #dupTemp on sourceTable.a = #dupTemp.a and sourceTable.b = #dupTemp.b and sourceTable.c = #dupTemp.c and sourceTable.d = #dupTemp.d and sourceTable.e = #dupTemp.e and sourceTable.f = #dupTemp.f and sourceTable.g = #dupTemp.g and sourceTable.h = #dupTemp.h and sourceTable.i = #dupTemp.i and sourceTable.PriKey != #dupTemp.MaxPriKey

Any tips on how to speed this up, or a faster way? Remember that I will have to run this again for rows that are not exact duplicates.

Many thanks.

UPDATE:
I had to stop step 2 from the 9-hour mark. I tried the OMG Ponies method and it finished in only 40 minutes. I tried to complete my step 2 with the annotated version of the package, 9 hours before I stopped it. UPDATE: Selected a similar query with one smaller field to get rid of another set of duplicates, and the query was executed in just 4 minutes (8000 rows) using the OMG Ponies method.

I will try the cte method, the next chance I will get, however, I suspect that the OMG Ponies method will be hard to beat.

+6

sql sql-server sql-server-2008 etl

Oo Aug 17 '10 at 21:56

source share

6 answers

Can you allow the source table to be unavailable for a short time?

I think the fastest solution is to create a new table without duplicates. Basically the approach you use with a temporary table, but instead create a “regular” table.

Then cancel the original table and rename the staging table with the same name as the old table.

+4

a_horse_with_no_name Aug 17 '10 at 22:15

source share

The bottleneck in removing a large row is usually the transaction that SQL Server must create. You may be able to significantly speed it up by dividing the deletion into smaller transactions. For example, to delete 100 rows at a time:

 while 1=1 begin delete top 100 from sourceTable ... if @@rowcount = 0 break end

+3

Andomar Aug 17 '10 at 22:10

source share

... based on the OMG Ponies comment above, the CTE method, which is a bit more compact. This method works wonders on tables where you (for some reason) do not have a primary key, where you can have rows that are the same for all columns.

 ;WITH cte AS ( SELECT ROW_NUMBER() OVER (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY prikey DESC) AS sequence FROM sourceTable ) DELETE FROM cte WHERE sequence > 1

+1

Will a Aug 17 '10 at 22:23

source share

Well, a lot of different things. At first there will be something like this work (make a choice o make sure you might even put it in your temporary table, #recordsToDelete):

 delete from sourceTable left join #dupTemp on sourceTable.PriKey = #dupTemp.MaxPriKey where #dupTemp.MaxPriKey is null

Then you can index the temporary tables, put the index on prikey

If you have entries in the temporary table of those that you want to delete, you can delete them in batches, which are often faster than locking the entire table with deletion.

0

Hlgem Aug 17 '10 at 22:04

source share

Here you can combine both steps in one step.

 WITH cte AS ( SELECT prikey, ROW_NUMBER() OVER (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY prikey DESC) AS sequence FROM sourceTable ) DELETE FROM sourceTable WHERE prikey IN ( SELECT prikey FROM cte WHERE sequence > 1 ) ;

By the way, do you have any indexes that can be temporarily removed?

0

bobs Aug 17 '10 at 22:16

source share

OMG Ponies · Accepted Answer · 2010-08-17T22:01:59+0000

How about EXISTS:

 DELETE FROM sourceTable WHERE EXISTS(SELECT NULL FROM #dupTemp dt WHERE sourceTable.a = dt.a AND sourceTable.b = dt.b AND sourceTable.c = dt.c AND sourceTable.d = dt.d AND sourceTable.e = dt.e AND sourceTable.f = dt.f AND sourceTable.g = dt.g AND sourceTable.h = dt.h AND sourceTable.i = dt.i AND sourceTable.PriKey < dt.MaxPriKey)

The fastest way to delete duplicate data

More articles: