Removing duplicate rows from redshift

Question

Removing duplicate rows from redshift

I am trying to delete some duplicate data in my redshift table.

Below is my request: -

With duplicates As (Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name) delete from duplicates Where Duplicate > 1 ;

This request gives me an error.

Amazon Invalid operation: syntax error at or next to "delete";

Not sure if the problem is that the for syntax with the sentence seems to be correct. Has anyone encountered this situation before?

+6

sql sql-delete amazon-redshift

Neil Jun 2 '16 at 3:44

source share

5 answers

systemjack · Answer 1 · 2016-06-22T18:56:59+0000

Redshift is that it (without forced uniqueness for any column) is probably best suited to Ziggy 3rd. As soon as we decide to go along the pace route, it is more efficient to exchange things in general. Removal and insertion are expensive in Redshift.

 begin; create table table_name_new as select distinct * from table_name; alter table table_name rename to table_name_old; alter table table_name_new rename to table_name; drop table table_name_old; commit;

If space is not a problem, you can keep the old table for some time and use the other methods described here to verify that the number of rows in the original accounting for duplicates matches the number of rows in the new one.

If you are performing constant loads on such a table, you will want to pause this process while this is happening.

Elliot chance · Answer 2 · 2017-03-01T06:04:33+0000

If you are dealing with a lot of data, it is not always possible or clever to recreate the entire table. It may be easier to find, delete these lines:

 -- First identify all the rows that are duplicate CREATE TEMP TABLE duplicate_saleids AS SELECT saleid FROM sales WHERE saledateid BETWEEN 2224 AND 2231 GROUP BY saleid HAVING COUNT(*) > 1; -- Extract one copy of all the duplicate rows CREATE TEMP TABLE new_sales(LIKE sales); INSERT INTO new_sales SELECT DISTINCT * FROM sales WHERE saledateid BETWEEN 2224 AND 2231 AND saleid IN( SELECT saleid FROM duplicate_saleids ); -- Remove all rows that were duplicated (all copies). DELETE FROM sales WHERE saledateid BETWEEN 2224 AND 2231 AND saleid IN( SELECT saleid FROM duplicate_saleids ); -- Insert back in the single copies INSERT INTO sales SELECT * FROM new_sales; -- Cleanup DROP TABLE duplicate_saleids; DROP TABLE new_sales; COMMIT;

Full article: https://elliot.land/post/removing-duplicate-data-in-redshift

Phil scalo · Answer 3 · 2017-05-23T20:09:35+0000

After deleting all entries in the 'tablename' that have a duplicate, it will not deduplicate the table:

 DELETE FROM tablename WHERE id IN ( SELECT id FROM ( SELECT id, ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum FROM tablename ) t WHERE t.rnum > 1);

Postgres Administrator Indents

Matthijs · Answer 4 · 2017-09-28T08:51:52+0000

Your query does not work because Redshift does not resolve DELETE after the WITH clause. Only SELECT and UPDATE and some others are allowed (see WITH clause )

Solution (in my situation):

I had an id column in my events table that contained duplicate rows and uniquely identified the record. This id column matches your record_indicator .

Unfortunately, I was unable to create a temporary table because I used the following error using SELECT DISTINCT :

ERROR: Intermediate result row exceeds database block size

But it worked like a charm:

 CREATE TABLE temp as ( SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber FROM events );

leads to the temp table:

 id | rownumber | ... ---------------- 1 | 1 | ... 1 | 2 | ... 2 | 1 | ... 2 | 2 | ...

Now duplicates can be removed by deleting rows having rownumber greater than 1:

 DELETE FROM temp WHERE rownumber > 1

After that, rename the tables and made by you.

Ziggy Crueltyfree Zeitgeister · Answer 5 · 2016-06-02T04:03:28+0000

That should work. Alternative you can do:

 With duplicates As ( Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name) delete from table_name where id in (select id from duplicates Where Duplicate > 1);

or

 delete from table_name where id in ( select id from ( Select id, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name) x Where Duplicate > 1);

If you do not have a primary key, you can do the following:

 BEGIN; CREATE TEMP TABLE mydups ON COMMIT DROP AS SELECT DISTINCT ON (record_indicator) * FROM table_name ORDER BY record_indicator --, other_optional_priority_field DESC ; DELETE FROM table_name WHERE record_indicator IN ( SELECT record_indicator FROM mydups); INSERT INTO table_name SELECT * FROM mydups; COMMIT;

Removing duplicate rows from redshift

More articles: