Redshift: breaking Serializable isolation on a table

I have a very large Redshift database containing billions of rows of HTTP request data.

I have a table called requests that has several important fields:

  • ip_address
  • city
  • state
  • country

I have a Python process running once a day that captures all the individual lines that have not yet been geocoded (do not have information about the city / state / country), and then tries to geocode each IP address through the Google geocoding API.

This process (pseudo-code) is as follows:

 for ip_address in ips_to_geocode: country, state, city = geocode_ip_address(ip_address) execute_transaction(''' UPDATE requests SET ip_country = %s, ip_state = %s, ip_city = %s WHERE ip_address = %s ''') 

When I run this code, I often get the following errors:

 psycopg2.InternalError: 1023 DETAIL: Serializable isolation violation on table - 108263, transactions forming the cycle are: 647671, 647682 (pid:23880) 

I assume that this is due to the fact that I have other processes that constantly register HTTP requests in my table, so when I try to execute my UPDATE statement, it cannot select all the rows with the ip address that I would wanted to update.

My question is this: what can I do to update these records in a reasonable way that will stop working fine?

+9
sql amazon-redshift
source share
4 answers

Your code violates the serializable Redshift isolation level. You must ensure that your code does not attempt to open multiple transactions in the same table before closing all open transactions.

This can be achieved by locking the table in each transaction so that no other transaction can access the table for updates until the open transaction is closed. Not sure how your code is designed (synchronous or asynchronous), but it will increase the execution time, since each lock will cause others to wait until the transaction is completed.

See: http://docs.aws.amazon.com/redshift/latest/dg/r_LOCK.html

+5
source share

The same problem just appeared with my code, and here is how I fixed it:

First of all, it is useful to know that this error code means that you are trying to perform parallel operations at redshift. For example, when you make a second query to a table before the first query was executed, which you made a few minutes ago, you may get such an error (this was my case).

The good news: There is an easy way to serialize redshift operations! You just need to use the LOCK command. Here is the Amazon documentation for the redshift LOCK command. This works basically, making the next operation wait until the previous one is closed. Please note that with this command your script will be a little slower.

In the end, the practical solution for me was: I inserted the LOCK command before the request messages (on the same line, separated by the ";" character). More or less like this:

LOCK table_name; SELECT * from ...

And you should be good to go! I hope this helps you.

+3
source share

Either you start a new session when you perform a second update in the same table, or you must β€œcommit” after the transaction is completed.

You can write set autocommit = on before starting the upgrade.

0
source share

Since you perform point updates during the geocode update process, while other processes write to the table, you may occasionally receive a serialization isolation violation error depending on how and when another process writes. to the same table.

suggestions

  • One way is to use table locking, as Marcus Vinicius Melo suggested in his answer.
  • Another approach is to detect the error and restart the transaction.

It is said that for any serializable transaction, the code initiating the transaction must be prepared to repeat the transaction in the face of this error. Since all transactions in Redshift are strictly serializable, all transactions that trigger code in Redshift must be prepared to repeat them in the face of this error.

Explanations

A typical reason for this error is that two transactions are launched and continued in their operations in such a way that at least one of them cannot be completed as if they were being executed one after another. Thus, the database system decides to abort one of them by issuing this error. This, in essence, returns control to the source code of the transaction for taking the appropriate course of action. Try to be one of them.

One way to prevent this inconsistent sequence of operations is to use a lock. But then this limits the simultaneous execution of many cases, which would not lead to a conflicting sequence of operations. Locking ensures that no error occurs, but will also limit concurrency. The repetition approach allows concurrency to have a chance and handles the case when a conflict arises.

Recommendation

However, I still recommend that you do not update Redshift in the same way as spot updates. The process of updating geocodes should write to the staging table, and after processing all the records, perform one bulk update, and then, if necessary, a vacuum.

0
source share

All Articles