I have a very large Redshift database containing billions of rows of HTTP request data.
I have a table called requests that has several important fields:
ip_addresscitystatecountry
I have a Python process running once a day that captures all the individual lines that have not yet been geocoded (do not have information about the city / state / country), and then tries to geocode each IP address through the Google geocoding API.
This process (pseudo-code) is as follows:
for ip_address in ips_to_geocode: country, state, city = geocode_ip_address(ip_address) execute_transaction(''' UPDATE requests SET ip_country = %s, ip_state = %s, ip_city = %s WHERE ip_address = %s ''')
When I run this code, I often get the following errors:
psycopg2.InternalError: 1023 DETAIL: Serializable isolation violation on table - 108263, transactions forming the cycle are: 647671, 647682 (pid:23880)
I assume that this is due to the fact that I have other processes that constantly register HTTP requests in my table, so when I try to execute my UPDATE statement, it cannot select all the rows with the ip address that I would wanted to update.
My question is this: what can I do to update these records in a reasonable way that will stop working fine?
sql amazon-redshift
rdegges
source share