After searching stackoverflow.com, I found a few questions on how to remove duplicates, but none of them indicated speed.
In my case, I have a table with 10 columns containing 5 million exact rows. In addition, I have at least a million other rows with duplicates in 9 out of 10 columns. My current technique takes (for now) 3 hours to remove the 5 million lines. Here is my process:
-- Step 1: **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table select MAX(prikey) as MaxPriKey, -- identity(1, 1) a, b, c, d, e, f, g, h, i into #dupTemp FROM sourceTable group by a, b, c, d, e, f, g, h, i having COUNT(*) > 1
Further
-- Step 2: **This step is taking the 3+ hours** -- delete the row when all the non-unique columns are the same (duplicates) and -- have a smaller prikey not equal to the max prikey delete from sourceTable from sourceTable inner join
Any tips on how to speed this up, or a faster way? Remember that I will have to run this again for rows that are not exact duplicates.
Many thanks.
UPDATE:
I had to stop step 2 from the 9-hour mark. I tried the OMG Ponies method and it finished in only 40 minutes. I tried to complete my step 2 with the annotated version of the package, 9 hours before I stopped it. UPDATE: Selected a similar query with one smaller field to get rid of another set of duplicates, and the query was executed in just 4 minutes (8000 rows) using the OMG Ponies method.
I will try the cte method, the next chance I will get, however, I suspect that the OMG Ponies method will be hard to beat.