I have an ETL process that is gradually expanding size tables in RedShift. It performs actions in the following order:
- Starts a transaction
- Creates a staging_foo table as foo
- Copy data from an external source to staging_foo
- Performs bulk insert / update / delete on foo to match staging_foo
- Drop staging_foo
- Commit transaction
Individually, this process works, but to ensure continuous streaming updates to foo and redundancy in the event of a failure, I have several instances of the process running at the same time. And when this happens, I sometimes get parallel serialization errors. This is because both processes reproduce the same changes to foo from foo_staging in overlapping transactions.
What happens is that the first process creates the staging_foo table, and the second process is blocked when trying to create a table with the same name (this is what I want). When the first process completes a transaction (which may take a few seconds), I find that the second process is unlocked before the commit completes. Thus, it appears that taking a snapshot of the foo table before committing occurs, which causes insert / update / delete to fail (some of which may be redundant).
I theorize based on the documentation http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html where it says:
Parallel transactions are invisible to each other; they cannot find each other. Each concurrent transaction will create a database snapshot at the start of the transaction. Database snapshots are created in a transaction on the first occurrence of most SELECT statements, DML instructions such as COPY, DELETE, INSERT, UPDATE and TRUNCATE, as well as the following DDL commands:
ALTER TABLE (to add or remove columns)
CREATE TABLE
TABLE DROP
STARTING TABLE
The documentation above is a little confusing because it is said at first that the snapshot will be created at the beginning of the transaction, but subsequently says that the snapshot will only be created upon the first occurrence of some specific DML / DDL operations.
I do not want to make a deep copy, where I replace foo instead of a step-by-step update. I have other processes that constantly query this table, so there is never a time when I can replace it without interruption. Another question asks a similar question for a deep copy, but it will not work for me: How can I provide synchronous DDL operations on a table being replaced?
Is there a way to perform my operations in such a way as to avoid concurrent serialization errors? I need to make sure that read access is available to foo , so I cannot LOCK this table.