Why is a non-locked table lock released before a transaction is completed in RedShift?

Question

Why is a non-locked table lock released before a transaction is completed in RedShift?

I have an ETL process that is gradually expanding size tables in RedShift. It performs actions in the following order:

Starts a transaction
Creates a staging_foo table as foo
Copy data from an external source to staging_foo
Performs bulk insert / update / delete on foo to match staging_foo
Drop staging_foo
Commit transaction

Individually, this process works, but to ensure continuous streaming updates to foo and redundancy in the event of a failure, I have several instances of the process running at the same time. And when this happens, I sometimes get parallel serialization errors. This is because both processes reproduce the same changes to foo from foo_staging in overlapping transactions.

What happens is that the first process creates the staging_foo table, and the second process is blocked when trying to create a table with the same name (this is what I want). When the first process completes a transaction (which may take a few seconds), I find that the second process is unlocked before the commit completes. Thus, it appears that taking a snapshot of the foo table before committing occurs, which causes insert / update / delete to fail (some of which may be redundant).

I theorize based on the documentation http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html where it says:

Parallel transactions are invisible to each other; they cannot find each other. Each concurrent transaction will create a database snapshot at the start of the transaction. Database snapshots are created in a transaction on the first occurrence of most SELECT statements, DML instructions such as COPY, DELETE, INSERT, UPDATE and TRUNCATE, as well as the following DDL commands:
ALTER TABLE (to add or remove columns)
CREATE TABLE
TABLE DROP
STARTING TABLE

The documentation above is a little confusing because it is said at first that the snapshot will be created at the beginning of the transaction, but subsequently says that the snapshot will only be created upon the first occurrence of some specific DML / DDL operations.

I do not want to make a deep copy, where I replace foo instead of a step-by-step update. I have other processes that constantly query this table, so there is never a time when I can replace it without interruption. Another question asks a similar question for a deep copy, but it will not work for me: How can I provide synchronous DDL operations on a table being replaced?

Is there a way to perform my operations in such a way as to avoid concurrent serialization errors? I need to make sure that read access is available to foo , so I cannot LOCK this table.

+3

sql amazon-redshift etl

william Dec 31 '14 at 21:11

source share

1 answer

Joe harris · Accepted Answer · 2014-01-03T13:23:44+0000

OK, Postgres (and therefore Redshift [more or less]) uses MVCC (Multi Version Concurrency Control) to isolate transactions instead of a db / table / row / page lock model (as shown in SQL Server , MySQL, etc. ) Simplistically, each transaction works with the data as it existed at the start of the transaction .

So your comment “I have multiple instances of a process running at the same time” explains the problem. If process 2 begins when process 1 is running, then process 2 does not have the visibility of the results of process 1.

Why is a non-locked table lock released before a transaction is completed in RedShift?

More articles: