Delete fuzzy lines

Question

Delete fuzzy lines

I have a table with a unique non-clustered index and 4 columns are listed in that index. I want to update a large number of rows in a table. If I do this, they will no longer be different, so the update will not be performed due to the index.

I want to disable the index and then delete the oldest duplicate rows. Here is my request:

SELECT t.itemid, t.fieldid, t.version, updated FROM dbo.VersionedFields w inner JOIN ( SELECT itemid, fieldid, version, COUNT(*) AS QTY FROM dbo.VersionedFields GROUP BY itemid, fieldid, version HAVING COUNT(*) > 1 ) t on w.itemid = t.itemid and w.fieldid = t.fieldid and w.version = t.version

Choosing inside the inner join returns the correct number of records that we want to delete, but groups them so that in fact it is twice the amount.

After the merge, all records are displayed, but all I want to delete is the oldest?

How can I do that?

+7

sql sql-server greatest-n-per-group

Luke wilkinson Aug 2 '11 at 17:30

source share

4 answers

In SQL Server 2005 and later:

 WITH q AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY itemid, fieldid, version ORDER BY updated DESC) AS rn FROM versionedFields ) DELETE FROM q WHERE rn > 1

+4

Quassnoi Aug 2 '11 at 17:34

source share

Try something like:

 DELETE FROM dbo.VersionedFields w WHERE w.version < (SELECT MAX(version) FROM dbo.VersionedFields)

Of course, you want to limit the MAX (version) to only the versions of the field you want to delete.

0

Malfist Aug 2 '11 at 17:34

source share

You probably need to look at this answer (delete previously duplicate rows).

In fact, this method uses a grouping (or, optionally, a window) to find the minimum id value of the group to remove it. It may be more accurate to delete rows where the value is <> max (row identifier).

So:

Delete unique index
Data loading
Delete the data using the grouping mechanism (ideally, in a transaction so that you can roll back if there is an error), then do
Recover index.

Note that re-creating an index in a large table can take a long time.

0

rorycl Aug 2 '11 at 17:40

source share

marc_s · Accepted Answer · 2011-08-02T17:38:55+0000

If you say SQL (Structured Query Language), but it really means SQL Server (Microsoft Relational Database System), and if you are using SQL Server 2005 or later, you can use CTE (Common Table Expression) for this purpose.

With this CTE, you can break down your data according to some criteria - i.e. your ItemId (or combination of columns) - and have a SQL Server number for all of your rows, starting with 1 for each of these sections, sorted by some other criteria - that is, probably version (or some other column).

So try something like this:

 ;WITH PartitionedData AS ( SELECT itemid, fieldid, version, ROW_NUMBER() OVER(PARTITION BY ItemId ORDER BY version DESC) AS 'RowNum' FROM dbo.VersionedFields ) DELETE FROM PartitionedData WHERE RowNum > 1

Basically, you break down your data according to certain criteria and the numbering of each section, starting from 1 for each new section, sorting by other criteria (for example, Date or Version).

Thus, for each “section” of data, the “latest” record has a value of RowNum = 1, and any others that belong to the same section (using the same partitino values) will be sequentially numbered with values from 2 to, however, there are many lines in this section .

If you want to keep only the newest record, delete everything with RowNum greater than 1, and you're done!

Delete fuzzy lines

More articles: