How to find duplicate records in PostgreSQL

I have a PostgreSQL database table called "user_links" that currently allows the following duplicate fields:

year, user_id, sid, cid 

The only restriction is currently the first field named "id", however now I want to add a restriction to make sure that year , user_id , sid and cid unique, but I cannot apply the restriction because duplicate values โ€‹โ€‹already exist that violate this restriction.

Is there a way to find all duplicates?

+73
sql duplicates postgresql
Jan 26 '15 at 18:55
source share
3 answers

The main idea will be to use a nested query with count aggregation:

 select * from yourTable ou where (select count(*) from yourTable inr where inr.sid = ou.sid) > 1 

You can customize the where clause in your internal query to narrow your search.




There is another good solution mentioned in the comments (but not everyone reads it):

 select Column1, Column2, count(*) from yourTable group by Column1, Column2 HAVING count(*) > 1 

Or shorter:

 SELECT (yourTable.*)::text, count(*) FROM yourTable GROUP BY yourTable.* HAVING count(*) > 1 
+113
Jan 26 '15 at 19:16
source share

From " Find duplicate rows with PostgreSQL " here is a smart solution:

 select * from ( SELECT id, ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row FROM tbl ) dups where dups.Row > 1 
+40
Aug 28 '15 at 7:26
source share

You can join the same table in the fields that will be duplicated, and then anti-join the id field. Select the id field from the first table alias (tn1), and then use the array_agg function in the id field of the second table alias. Finally, for the array_agg function to work correctly, you will group the results in the tn1.id field. This will create a result set containing the record identifier and an array of all identifiers that match the connection conditions.

 select tn1.id, array_agg(tn2.id) as duplicate_entries, from table_name tn1 join table_name tn2 on tn1.year = tn2.year and tn1.sid = tn2.sid and tn1.user_id = tn2.user_id and tn1.cid = tn2.cid and tn1.id <> tn2.id group by tn1.id; 

Obviously, the id that will be in the duplicate_entries array for one id will also have its own records in the result set. You will have to use this result set to decide which identifier you want to become the source of "truth." One entry that should not be deleted. Perhaps you could do something like this:

 with dupe_set as ( select tn1.id, array_agg(tn2.id) as duplicate_entries, from table_name tn1 join table_name tn2 on tn1.year = tn2.year and tn1.sid = tn2.sid and tn1.user_id = tn2.user_id and tn1.cid = tn2.cid and tn1.id <> tn2.id group by tn1.id order by tn1.id asc) select ds.id from dupe_set ds where not exists (select de from unnest(ds.duplicate_entries) as de where de < ds.id) 

Selects the identifier of the smallest number that has duplicates (assuming the identifier increments int PK). This will be the identifier that you would save.

+2
Jan 26 '15 at 19:24
source share



All Articles