How to set up an asynchronous long-standing background data processing task?

Newb quesion about Django app design:

I am creating a reporting engine for my website. And I have a large (and increasing over time) amount of data and some algorithm that should be applied to it. Calculations promise to be heavy on resources, and it would be foolish if they were performed at the request of users. So, I think, to put them in a background process that will run continuously and from time to time return results that can be loaded into the Django-view procedure to create html output on demand.

And my question is: what is the right approach to building such a system? Any thoughts?

+4
source share
3 answers

Celery is one of your best options. We are using it successfully. It has a powerful scheduling mechanism - you can schedule tasks as timed tasks or run tasks in the background when a user (for example) requests it.

It also provides ways to query the status of such background tasks and has a number of flow control features. This makes it very easy to distribute work - for example, your tasks on the celery can be run on a separate machine (this is very useful, for example, using the heroku / split website, where the web process is limited to a maximum of 30 seconds per request). It provides various queue servers (it can use a database, rabbitMQ or a number of other queue mechanisms. With the simplest setup, it can use the same database that your Django site already uses for this (which simplifies the setup).

And if you use automatic tests, it also has a function that helps with testing - it can be installed in the β€œimpatient” mode, where background tasks are not performed in the background, which gives predictable logical testing.

Further information here: http://docs.celeryproject.org:8000/en/latest/django/

+2
source

Do you mean that the results are returned to the database or do you want to create django-views directly from your independently working code?

If you have large amounts of data, I like to use Pythons multiprocessing . You can create a generator that populates JoinableQueue various tasks and a pool of Workers consuming various Tasks. Therefore, you should be able to maximize the use of resources in your system.

The multiprocessing module also allows you to perform several tasks over the network (for example, multiprocessing.Manager() ). With that in mind, you should easily scale things if you need a second machine to process data on time.

Example:

This example shows how to create multiple processes. The generator function should query the database for all new entries requiring heavy lifting. Consumers take individual items from the queue and perform actual calculations.

 import time from multiprocessing.queues import JoinableQueue from multiprocessing import Process QUEUE = JoinableQueue(-1) def generator(): """ Puts items in the queue. For example query database for all new, unprocessed entries that need some serious math done..""" while True: QUEUE.put("Item") time.sleep(0.1) def consumer(consumer_id): """ Consumes items from the queue... Do your calculations here... """ while True: item = QUEUE.get() print "Process %s has done: %s" % (consumer_id, item) QUEUE.task_done() p = Process(target=generator) p.start() for x in range(0, 2): w = Process(target=consumer, args=(x,)) w.start() p.join() w.join() 
0
source

Why don’t you have a url or python script that runs any calculation that you have to do every time it starts, and then retrieve that url or run this script through cronjob on the server? From your question, it didn't seem like you needed much more.

0
source

All Articles