Django multiprocessing and database connections

Background:

I am working on a project that uses Django with a Postgres database. We also use mod_wsgi in case this matters, as some of my internet searches mention this. In the web form view, the Django view launches a task that will take a considerable amount of time (more than the user would like to wait), so we start the work with a system call in the background. The task that is now being executed should be able to read and write to the database. Since this work takes a lot of time, we use multiprocessing for parallel operation of its parts.

Problem:

The top level script has a database connection, and when it spawns child processes, it seems that the parent connection is accessible to children. Then, an exception is thrown before the call because the INSULATION LEVEL SET TRANSACTION needs to be called. Studies have shown that this is due to an attempt to use the same database connection in several processes. One thread that I found suggested calling connection.close () at the beginning of the child processes so that Django would automatically create a new connection when it needed it, and so each child process would have a unique connection, i.e. not shared. This did not work for me, since calling connection.close () in the child process made the parent process complain that the connection was lost.

Other findings:

Some of the things I read seemed to indicate that you cannot do this, and that multiprocessing, mod_wsgi and Django do not mix well. It seems to me that it seems to me that it seems to me that I think.

Some of them suggest using celery, which can be a long-term solution, but I can’t install celery at this time, until some approval procedures are considered, so now it’s not an option.

Found several links to SO and elsewhere about persistent database connections, which I think are another problem.

Also found links to psycopg2.pool and pgpool and something like a bouncer. Admittedly, I did not understand much of what I read on these issues, but of course he did not jump at me as what I was looking for.

Current "Work-Around":

Currently, I will return to simply triggering events sequentially, and it works, but more slowly than I would like.

Any suggestions on how I can use multiprocessing for parallel work? It seems that if my parent and two children had independent connections to the database, everything would be fine, but I can not understand what this behavior is.

Thank you and sorry for the length!

+51
django multiprocessing
Nov 23 '11 at 13:18
source share
7 answers

A multiprocessor copies connection objects between processes because it processes processes and, therefore, copies all file descriptors of the parent process. The connection to the SQL server is just a file, you can see it in linux under / proc // fd / .... any open file will be shared between forked processes. You can find more about forking here .

My solution was simply to connect db just before starting the processes, each process recreates the connection when it is needed (tested in django 1.4):

from django import db db.connections.close_all() def db_worker(): some_paralell_code() Process(target = db_worker,args = ()) 

Pgbouncer / pgpool is not thread related in the sense of multiprocessing. It is rather a solution for not closing the connection for each request = accelerating the connection to postgres at high load.

Update:

To completely remove database connectivity issues, just move all the database logic to db_worker - I wanted to pass QueryDict as an argument ... The best idea is to just pass a list of identifiers ... See QueryDict and values_list (' id ', flat = True) and don't forget to include it in the list! (QueryDict) before moving on to db_worker. Due to this, we do not copy the connection to the model database.

 def db_worker(models_ids): obj = PartModelWorkerClass(model_ids) # here You do Model.objects.filter(id__in = model_ids) obj.run() model_ids = Model.objects.all().values_list('id', flat=True) model_ids = list(model_ids) # cast to list process_count = 5 delta = (len(model_ids) / process_count) + 1 # do all the db stuff here ... # here you can close db connection from django import db db.connections.close_all() for it in range(0:process_count): Process(target = db_worker,args = (model_ids[it*delta:(it+1)*delta])) 
+40
May 21 '12 at 11:47
source share
β€” -

When using multiple databases, you must close all connections.

 from django import db for connection_name in db.connections.databases: db.connections[connection_name].close() 

EDIT

Please use the same as @lechup mentioned to close all connections (not sure which version of django this method was added from):

 from django import db db.connections.close_all() 
+9
Jan 26 '14 at 22:02
source share

For Python 3 and Django 1.9, this is what worked for me:

 import multiprocessing import django django.setup() # Must call setup def db_worker(): for name, info in django.db.connections.databases.items(): # Close the DB connections django.db.connection.close() # Execute parallel code here if __name__ == '__main__': multiprocessing.Process(target=db_worker) 

Note that without django.setup (), I could not get this to work. I assume that something needs to be initialized again for multiprocessing.

+3
Jul 13 '16 at 15:57
source share

Hi, I ran into this problem and was able to solve it by following these steps (we are implementing a limited system of tasks)

task.py

 from django.db import connection def as_task(fn): """ this is a decorator that handles task duties, like setting up loggers, reporting on status...etc """ connection.close() # this is where i kill the database connection VERY IMPORTANT # This will force django to open a new unique connection, since on linux at least # Connections do not fare well when forked #...etc 

ScheduledJob.py

 from django.db import connection def run_task(request, job_id): """ Just a simple view that when hit with a specific job id kicks of said job """ # your logic goes here # ... processor = multiprocessing.Queue() multiprocessing.Process( target=call_command, # all of our tasks are setup as management commands in django args=[ job_info.management_command, ], kwargs= { 'web_processor': processor, }.items() + vars(options).items()).start() result = processor.get(timeout=10) # wait to get a response on a successful init # Result is a tuple of [TRUE|FALSE,<ErrorMessage>] if not result[0]: raise Exception(result[1]) else: # THE VERY VERY IMPORTANT PART HERE, notice that up to this point we haven't touched the db again, but now we absolutely have to call connection.close() connection.close() # we do some database accessing here to get the most recently updated job id in the database 

Honestly, to prevent race conditions (with multiple simultaneous users) it would be best to call database.close () as quickly as possible after you close the process. Perhaps there is still a chance that another user somewhere on the line will completely send the request to db before you have the opportunity to reset the database.

Everyone honestly will most likely be safer and smarter so that your fork does not invoke the command directly, but instead invokes a script in the operating system, so that the generated task runs on its own django shell!

+2
Oct. 31 '13 at 17:26
source share

(not a big solution, but a possible workaround)

If you cannot use celery, perhaps you could implement your own queuing system, basically adding tasks to some task table and having a regular cron that selects them and processes them? (using control command)

+1
Nov 23 '11 at 1:30 p.m.
source share

If you only need I / O parallelism and not handle parallelism, you can avoid this problem by switching your processes to threads. Replace

 from multiprocessing import Process 

from

 from threading import Thread 

Thread object has the same interface as Procsess

+1
Nov 06 '17 at 18:06
source share

You can provide more Postgre resources, in Debian / Ubuntu you can edit:

 nano /etc/postgresql/9.4/main/postgresql.conf 

replacing 9.4 with your postgre version.

Here are a few useful lines that need to be updated with the example values ​​so that the names display themselves:

 max_connections=100 shared_buffers = 3000MB temp_buffers = 800MB effective_io_concurrency = 300 max_worker_processes = 80 

Be careful not to increase too many of these parameters, as this can lead to errors with Postgre, trying to get more resources than is available. The examples above work fine on a Debian 8GB Ram machine equipped with four cores.

0
Jun 29 '15 at 19:55
source share



All Articles