Why does a missing primary key / unique key cause lock problems on reboot?

Question

Why does a missing primary key / unique key cause lock problems on reboot?

I came across a schema and an upsert stored procedure that caused locking issues. I have a general idea of why this causes a dead end and how to fix it. I can reproduce it, but I do not have a clear understanding of the sequence of steps that cause it. It would be great if someone could explain why this is causing a dead end.

Here is the schema and stored procedures. This code runs on PostgreSQL 9.2.2.

CREATE TABLE counters ( count_type INTEGER NOT NULL, count_id INTEGER NOT NULL, count INTEGER NOT NULL ); CREATE TABLE primary_relation ( id INTEGER PRIMARY KEY, a_counter INTEGER NOT NULL DEFAULT 0 ); INSERT INTO primary_relation SELECT i FROM generate_series(1,5) AS i; CREATE OR REPLACE FUNCTION increment_count(ctype integer, cid integer, i integer) RETURNS VOID AS $$ BEGIN LOOP UPDATE counters SET count = count + i WHERE count_type = ctype AND count_id = cid; IF FOUND THEN RETURN; END IF; BEGIN INSERT INTO counters (count_type, count_id, count) VALUES (ctype, cid, i); RETURN; EXCEPTION WHEN OTHERS THEN END; END LOOP; END; $$ LANGUAGE PLPGSQL; CREATE OR REPLACE FUNCTION update_primary_a_count(ctype integer) RETURNS VOID AS $$ WITH deleted_counts_cte AS ( DELETE FROM counters WHERE count_type = ctype RETURNING * ), rollup_cte AS ( SELECT count_id, SUM(count) AS count FROM deleted_counts_cte GROUP BY count_id HAVING SUM(count) <> 0 ) UPDATE primary_relation SET a_counter = a_counter + rollup_cte.count FROM rollup_cte WHERE primary_relation.id = rollup_cte.count_id $$ LANGUAGE SQL;

And here is the python script to reproduce the dead end.

 import os import random import time import psycopg2 COUNTERS = 5 THREADS = 10 ITERATIONS = 500 def increment(): outf = open('synctest.out.%d' % os.getpid(), 'w') conn = psycopg2.connect(database="test") cur = conn.cursor() for i in range(0,ITERATIONS): time.sleep(random.random()) start = time.time() cur.execute("SELECT increment_count(0, %s, 1)", [random.randint(1,COUNTERS)]) conn.commit() outf.write("%f\n" % (time.time() - start)) conn.close() outf.close() def update(n): outf = open('synctest.update', 'w') conn = psycopg2.connect(database="test") cur = conn.cursor() for i in range(0,n): time.sleep(random.random()) start = time.time() cur.execute("SELECT update_primary_a_count(0)") conn.commit() outf.write("%f\n" % (time.time() - start)) conn.close() pids = [] for i in range(THREADS): pid = os.fork() if pid != 0: print 'Process %d spawned' % pid pids.append(pid) else: print 'Starting child %d' % os.getpid() increment() print 'Exiting child %d' % os.getpid() os._exit(0) update(ITERATIONS) for pid in pids: print "waiting on %d" % pid os.waitpid(pid, 0) # cleanup update(1)

I understand that one problem is that upsert will produce duplicate lines (with multiple authors) that are likely to lead to some kind of double count. But why does this lead to a dead end?

The error received from PostgreSQL looks something like this:

 process 91924 detected deadlock while waiting for ShareLock on transaction 4683083 after 100.559 ms",,,,,"SQL statement ""UPDATE counters

And the client spews something like this:

 psycopg2.extensions.TransactionRollbackError: deadlock detected DETAIL: Process 91924 waits for ShareLock on transaction 4683083; blocked by process 91933. Process 91933 waits for ShareLock on transaction 4683079; blocked by process 91924. HINT: See server log for query details.CONTEXT: SQL statement "UPDATE counters SET count = count + i WHERE count_type = ctype AND count_id = cid" PL/pgSQL function increment_count(integer,integer,integer) line 4 at SQL statement

To fix the problem, you need to add the primary key as follows:

 ALTER TABLE counters ADD PRIMARY KEY (count_type, count_id);

Any insight would be very helpful. Thanks!

+7

sql database concurrency relational-database postgresql

drsnyder Jan 2 '14 at 17:11

source share

3 answers

Bruno · Answer 1 · 2014-01-20T15:16:16+0000

because of the primary key, the number of rows in this table is always equal to <= # threads, and the primary key ensures that no row is repeated.

When you delete the primary key, some threads lag and the number of lines increases, and at the same time the lines are repeated. When lines are repeated, the update time is longer, and two or more threads will try to update the same lines.

Open a new terminal and enter:

 watch --interval 1 "psql -tc \"select count(*) from counters\" test"

Try this with and without a primary key. When you get the first deadlock, review the query results above. In my case, this is what I left in the table counters:

 test=# select * from counters order by 2; count_type | count_id | count ------------+----------+------- 0 | 1 | 735 0 | 1 | 733 0 | 1 | 735 0 | 1 | 735 0 | 2 | 916 0 | 2 | 914 0 | 2 | 914 0 | 3 | 882 0 | 4 | 999 0 | 5 | 691 0 | 5 | 692 (11 rows)

Carlos Grappa · Answer 2 · 2014-01-20T21:32:37+0000

Your code is the perfect recipe for race conditions (multiple threads, random dreams). The problem is most likely related to locking issues, since you are not mentioning the locking mode, I am going to assume that this is page-based locking, so you get the following script:

Thread 1 begins, it begins to insert records, say that it blocks page number 1 and blocks page 2.
Thread 2 starts at the same time as 1, but it blocks the first page 2 and blocks the next page.
Both threads are now waiting for each other to complete, so you have a dead end.

Now, why does PK fix it?

Since the blocking is performed first by index, the race condition is softened, because the PC is unique for inserts, so all threads are waiting for the index, and updates are accessed through the index, so recording is blocked based on its PC.

simon at rcl · Answer 3 · 2014-01-16T17:34:45+0000

At one point, one user is waiting for another user to lock, and the first user owns the lock that the second user wants. This is what causes the impasse.

Guess, because without a primary key (or virtually any key), when you have UPDATE counters in your sp increment, it should read the entire table. The same thing happens with the primary_relation table. This is going to leave hidden castles and open the way to a dead end. I am not a Postgres user, so I don’t know the details of when it will post locks, but I'm sure this is what happens.

Enabling PK on the counters allows the database to orient the lines it reads accurately and impose a minimum number of locks. You really must have PK for primary_relation too!

Why does a missing primary key / unique key cause lock problems on reboot?

More articles: