What you are looking for is a combination of offline processing and caching. In standalone mode, I mean that the logic of the calculations occurs outside the request-response cycle. By caching, I mean that the result of your expensive calculation is important enough for X-time, during which you do not need to recount it to display. This is a very common picture.
Offline processing
There are two commonly used work approaches that must be performed outside the request-response cycle:
- Cron Tasks (often simplified with a custom management command)
- Celery
In relative terms, cron is easier to set up, and Celery is more powerful / flexible. This suggests that celery has fantastic documentation and a comprehensive test suite. I used it in production for almost every project, and although it requires certain requirements, it is not a bear to customize.
Cron
Cron Jobs is a time-tested method. If you just need to run some logic and store some result in the database, the cron job has zero dependencies. The only bit bits with cron jobs are running your code in the context of a django project, i.e. Your code should load your settings.py correctly to find out about your database and applications. For the uninitiated, this can lead to some aggravation in determining the correct PYTHONPATH and such.
If you are following the cron route, a good approach is to write a custom control command. It will be easy for you to check your command from the terminal (and write tests), and you will not need to do any special hoopla at the top of your control command to set up the proper django environment. In the production process, you simply run path/to/manage.py yourcommand . I'm not sure if this approach works without the help of virtualenv , which you really should use independently.
Another aspect to consider with cronjob: if your logic takes a certain amount of time to start, cron does not know this issue. A nice way to kill your server is to run a two-hour wand, like every hour. You can collapse your own locking mechanism to prevent this, just be aware of this - what starts because a short cronjob cannot stay that way when your data grows, or when your RDBMS is wrong, etc. Etc.
In your case, it looks like cron is less applicable, because you will need to calculate graphs for each user each time, regardless of who actually uses the system. It can help celery.
Celery
... these are bee knees. Usually people scare off the AMQP brokerβs default requirement. This is not a terribly burdensome RabbitMQ setup, but it takes a little bit beyond the comfortable Python world. For many tasks, I just use redis as my task store for Celery. The settings are simple :
CELERY_RESULT_BACKEND = "redis" REDIS_HOST = "localhost" REDIS_PORT = 6379 REDIS_DB = 0 REDIS_CONNECT_RETRY = True
VoilΓ‘, no need for an AMQP broker.
Celery provides many advantages over simple cron jobs. Like cron, you can schedule periodic tasks , but you can also run tasks in response to other incentives without supporting a request / response cycle.
If you do not want to calculate the schedule for each active user so often, you will need to generate it on demand. I assume that queries for the most recent available averages are cheap, calculating new averages is expensive, and you generate actual client-side charts using something like flot . Here is an example thread:
- The user requests a page containing a graph of averages.
- Check cache - is there a stored, complicated request containing average values ββfor this user?
- If yes, use this.
- If not, cancel the celery task to recount it, require and cache the result. Since querying existing data is cheap, run a query if you want to display stale data to the user.
- If the chart is out of date. itβs not necessary to indicate some signs that the chart is out of date or to make some ajax fix for ping django so often and ask if the updated schedule is ready.
You could combine this with a periodic task to recalculate the schedule every hour for users who have an active session to prevent the display of really outdated charts. This is not the only way to fool a cat, but it provides you with all the control you need to ensure freshness while regulating the processor load with the task of computing. Best of all, the periodic task and the on-demand task share the same logic - you define the task once and call it from both places to add DRY.
Caching
The Django cache framework provides you with all the caching you need to cache as much as you want, as much as you want. Most production sites rely on memcached as their cache, I recently started using redis with django-redis-cache instead, but I'm not sure I will trust it on the main production site.
Here's a code demonstrating the use of a low-level caching API to execute the workflow described above:
import pickle from django.core.cache import cache from django.shortcuts import render from mytasks import calculate_stuff from celery.task import task @task def calculate_stuff(user_id):
Edit: It is worth noting that etching a set of requests loads the entire request into memory. If you pull up a large amount of data using a set of query averages, this may be suboptimal. Testing using real data would be reasonable anyway.