Copy data from Amazon S3 to Redshift and avoid line duplication

I am copying data from Amazon S3 to Redshift. During this process, I need to avoid re-downloading the same files. I have no particular restrictions for my Redshift table. Is there a way to implement this using the copy command?

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

I tried to add a unique constraint and set the column as the primary key with no luck. Redshift does not seem to support unique / primary key constraints.

+7
source share
4 answers

My solution is to run the delete command before the copy in the table. In my use case, every time I need to copy daily snapshot entries to the redshift table, I can use the following delete command to remove duplicate entries, then run the copy command.

DELETE from t_data where snapshot_day = 'xxxx-xx-xx';

+5
source

As user1045047 mentioned, Amazon Redshift does not support unique restrictions, so I was looking for a way to remove duplicate records from a table using the delete statement. Finally, I found a reasonable way.

Amazon Redshift supports the creation of an IDENTITY column that stores a unique unique number. http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

The following sql for PostgreSQL is to remove duplicate records with an OID that is a unique column, and you can use this sql by replacing the OID with an identity column.

DELETE FROM duplicated_table WHERE OID > (  SELECT MIN(OID) FROM duplicated_table d2   WHERE column1 = d2.dupl_column1   AND column2 = d2.column2 ); 

Here is an example that I tested on my Amazon Redshift cluster.

 create table auto_id_table (auto_id int IDENTITY, name varchar, age int); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('Bob', 20); insert into auto_id_table (name, age) values('Bob', 20); insert into auto_id_table (name, age) values('Matt', 24); select * from auto_id_table order by auto_id; auto_id | name | age ---------+------+----- 1 | John | 18 2 | John | 18 3 | John | 18 4 | John | 18 5 | John | 18 6 | Bob | 20 7 | Bob | 20 8 | Matt | 24 (8 rows) delete from auto_id_table where auto_id > ( select min(auto_id) from auto_id_table d where auto_id_table.name = d.name and auto_id_table.age = d.age ); select * from auto_id_table order by auto_id; auto_id | name | age ---------+------+----- 1 | John | 18 6 | Bob | 20 8 | Matt | 24 (3 rows) 

He also works with the COPY team as follows.

  • auto_id_table.csv

     John,18 Bob,20 Matt,24 
  • copy sql

     copy auto_id_table (name, age) from '[s3-path]/auto_id_table.csv' CREDENTIALS 'aws_access_key_id=[your-aws-key-id] ;aws_secret_access_key=[your-aws-secret-key]' delimiter ','; 

The advantage of this method is that you do not need to run DDL statements. However, it does not work with existing tables that do not have an identifier column, because the identifier column cannot be added to an existing table. The only way to delete duplicate records with existing tables is to migrate all records like this. (same as user1045047 answer)

 insert into temp_table (select distinct from original_table); drop table original_table; alter table temp_table rename to original_table; 
+13
source

There is currently no way to remove duplicates from redshift. Redshift does not support primary key / unique key restrictions, and deleting duplicates using a line number is not an option (deleting lines with a line number greater than 1), because the delete operation at redshift does not allow the creation of complex instructions (also the concept of line number is not in red offset).

The best way to remove duplicates is to write a cron / quartz task that will select all the individual rows, put them in a separate table and then rename the table to the original table.

Insert into temp_originalTable (Select Distinct from originalTable)

Drop table originalTable

Alter table temp_originalTable rename to originalTable

+4
source

Mmm ..

How easy it is to never load data directly into your main table.

Steps to avoid duplication:

  • start a transaction
  • bulk loading into an intermediate level table
  • delete from the main table where rows = rows of the staging table
  • insert into the main table from the staging table (merge)
  • drop-down table
  • transaction completion.

it's the same super somewhat quick, and recommended by redshift docs.

+4
source

All Articles