Copy data from Amazon S3 to Redshift and avoid line duplication

Question

Copy data from Amazon S3 to Redshift and avoid line duplication

I am copying data from Amazon S3 to Redshift. During this process, I need to avoid re-downloading the same files. I have no particular restrictions for my Redshift table. Is there a way to implement this using the copy command?

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

I tried to add a unique constraint and set the column as the primary key with no luck. Redshift does not seem to support unique / primary key constraints.

+7

amazon duplicates copy amazon-s3 amazon-redshift

Rupesh nangalia Mar 29 '13 at 10:23

source share

4 answers

As user1045047 mentioned, Amazon Redshift does not support unique restrictions, so I was looking for a way to remove duplicate records from a table using the delete statement. Finally, I found a reasonable way.

Amazon Redshift supports the creation of an IDENTITY column that stores a unique unique number. http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

The following sql for PostgreSQL is to remove duplicate records with an OID that is a unique column, and you can use this sql by replacing the OID with an identity column.

DELETE FROM duplicated_table WHERE OID > (  SELECT MIN(OID) FROM duplicated_table d2   WHERE column1 = d2.dupl_column1   AND column2 = d2.column2 );

Here is an example that I tested on my Amazon Redshift cluster.

 create table auto_id_table (auto_id int IDENTITY, name varchar, age int); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('John', 18); insert into auto_id_table (name, age) values('Bob', 20); insert into auto_id_table (name, age) values('Bob', 20); insert into auto_id_table (name, age) values('Matt', 24); select * from auto_id_table order by auto_id; auto_id | name | age ---------+------+----- 1 | John | 18 2 | John | 18 3 | John | 18 4 | John | 18 5 | John | 18 6 | Bob | 20 7 | Bob | 20 8 | Matt | 24 (8 rows) delete from auto_id_table where auto_id > ( select min(auto_id) from auto_id_table d where auto_id_table.name = d.name and auto_id_table.age = d.age ); select * from auto_id_table order by auto_id; auto_id | name | age ---------+------+----- 1 | John | 18 6 | Bob | 20 8 | Matt | 24 (3 rows)

He also works with the COPY team as follows.

auto_id_table.csv
```
 John,18 Bob,20 Matt,24 
```

copy sql

 copy auto_id_table (name, age) from '[s3-path]/auto_id_table.csv' CREDENTIALS 'aws_access_key_id=[your-aws-key-id] ;aws_secret_access_key=[your-aws-secret-key]' delimiter ',';

The advantage of this method is that you do not need to run DDL statements. However, it does not work with existing tables that do not have an identifier column, because the identifier column cannot be added to an existing table. The only way to delete duplicate records with existing tables is to migrate all records like this. (same as user1045047 answer)

 insert into temp_table (select distinct from original_table); drop table original_table; alter table temp_table rename to original_table;

+13

Masashi miyazaki Jul 11 '13 at 7:38

source share

There is currently no way to remove duplicates from redshift. Redshift does not support primary key / unique key restrictions, and deleting duplicates using a line number is not an option (deleting lines with a line number greater than 1), because the delete operation at redshift does not allow the creation of complex instructions (also the concept of line number is not in red offset).

The best way to remove duplicates is to write a cron / quartz task that will select all the individual rows, put them in a separate table and then rename the table to the original table.

Insert into temp_originalTable (Select Distinct from originalTable)

Drop table originalTable

Alter table temp_originalTable rename to originalTable

+4

user1045047 Jun 12 '13 at 18:21

source share

Mmm ..

How easy it is to never load data directly into your main table.

Steps to avoid duplication:

start a transaction
bulk loading into an intermediate level table
delete from the main table where rows = rows of the staging table
insert into the main table from the staging table (merge)
drop-down table
transaction completion.

it's the same ~~super~~ somewhat quick, and recommended by redshift docs.

+4

Kyle gobel Nov 26 '14 at 15:25

source share

ciphor · Accepted Answer · 2013-08-12T15:58:47+0000

My solution is to run the delete command before the copy in the table. In my use case, every time I need to copy daily snapshot entries to the redshift table, I can use the following delete command to remove duplicate entries, then run the copy command.

DELETE from t_data where snapshot_day = 'xxxx-xx-xx';

Copy data from Amazon S3 to Redshift and avoid line duplication

More articles: