You can join the same table in the fields that will be duplicated, and then anti-join the id field. Select the id field from the first table alias (tn1), and then use the array_agg function in the id field of the second table alias. Finally, for the array_agg function to work correctly, you will group the results in the tn1.id field. This will create a result set containing the record identifier and an array of all identifiers that match the connection conditions.
select tn1.id, array_agg(tn2.id) as duplicate_entries, from table_name tn1 join table_name tn2 on tn1.year = tn2.year and tn1.sid = tn2.sid and tn1.user_id = tn2.user_id and tn1.cid = tn2.cid and tn1.id <> tn2.id group by tn1.id;
Obviously, the id that will be in the duplicate_entries array for one id will also have its own records in the result set. You will have to use this result set to decide which identifier you want to become the source of "truth." One entry that should not be deleted. Perhaps you could do something like this:
with dupe_set as ( select tn1.id, array_agg(tn2.id) as duplicate_entries, from table_name tn1 join table_name tn2 on tn1.year = tn2.year and tn1.sid = tn2.sid and tn1.user_id = tn2.user_id and tn1.cid = tn2.cid and tn1.id <> tn2.id group by tn1.id order by tn1.id asc) select ds.id from dupe_set ds where not exists (select de from unnest(ds.duplicate_entries) as de where de < ds.id)
Selects the identifier of the smallest number that has duplicates (assuming the identifier increments int PK). This will be the identifier that you would save.
pwnyexpress Jan 26 '15 at 19:24 2015-01-26 19:24
source share