Django design for web analytics screens that take a very long time to calculate

Question

Django design for web analytics screens that take a very long time to calculate

I have a dashboard screen that is visible to my django web application users who take a lot of time to calculate. This is one of these screens that goes through each transaction in the database for the user and gives them metrics.

I would really like this to be a real-time operation, but the calculation time can be 20-30 seconds for an active user (paging is not allowed, it gives average values for transactions).

The solution that comes to mind is to calculate this in the backend using the manage.py batch command, and then just display the cached values to the user. Is there a Django design template to facilitate these types of models / displays?

+8

python django batch-file

Miken Jul 31 '11 at 2:05

source share

4 answers

A simple and correct IMO solution for such scenarios is to pre-calculate everything as things are updated, so when the user sees the dashboard, you do not calculate anything, but simply show the already calculated values.

There may be different ways to do this, but the general concept is to call the calculation function in the background when some value that the calculation depends on changes.

To run this calculation in the background, I usually use celery , so suppose the user adds the foo element to the view_foo field of view, we call the celery update_foo_count task, which will run in the background and update the foo count, otherwise you can set the celery timer, which will update the counter every 10 minutes, checking whether it is necessary to recalculate, you can recount the flag set in different places where the user updates the data.

+3

Anurag uniyal Jul 31 '11 at 3:47

source share

You need to take a look at the Djangos cache infrastructure .

+1

Dominik Jul 31 '11 at 2:18

source share

If data that is being computed slowly can be denormalized and saved when data is added rather than viewed, then django-denorm might interest you .

0

Max peterson Jul 31 '11 at 10:45

source share

Idan gazit · Accepted Answer · 2011-07-31T12:11:29+0000

What you are looking for is a combination of offline processing and caching. In standalone mode, I mean that the logic of the calculations occurs outside the request-response cycle. By caching, I mean that the result of your expensive calculation is important enough for X-time, during which you do not need to recount it to display. This is a very common picture.

Offline processing

There are two commonly used work approaches that must be performed outside the request-response cycle:

Cron Tasks (often simplified with a custom management command)
Celery

In relative terms, cron is easier to set up, and Celery is more powerful / flexible. This suggests that celery has fantastic documentation and a comprehensive test suite. I used it in production for almost every project, and although it requires certain requirements, it is not a bear to customize.

Cron

Cron Jobs is a time-tested method. If you just need to run some logic and store some result in the database, the cron job has zero dependencies. The only bit bits with cron jobs are running your code in the context of a django project, i.e. Your code should load your settings.py correctly to find out about your database and applications. For the uninitiated, this can lead to some aggravation in determining the correct PYTHONPATH and such.

If you are following the cron route, a good approach is to write a custom control command. It will be easy for you to check your command from the terminal (and write tests), and you will not need to do any special hoopla at the top of your control command to set up the proper django environment. In the production process, you simply run path/to/manage.py yourcommand . I'm not sure if this approach works without the help of virtualenv , which you really should use independently.

Another aspect to consider with cronjob: if your logic takes a certain amount of time to start, cron does not know this issue. A nice way to kill your server is to run a two-hour wand, like every hour. You can collapse your own locking mechanism to prevent this, just be aware of this - what starts because a short cronjob cannot stay that way when your data grows, or when your RDBMS is wrong, etc. Etc.

In your case, it looks like cron is less applicable, because you will need to calculate graphs for each user each time, regardless of who actually uses the system. It can help celery.

Celery

... these are bee knees. Usually people scare off the AMQP broker’s default requirement. This is not a terribly burdensome RabbitMQ setup, but it takes a little bit beyond the comfortable Python world. For many tasks, I just use redis as my task store for Celery. The settings are simple :

 CELERY_RESULT_BACKEND = "redis" REDIS_HOST = "localhost" REDIS_PORT = 6379 REDIS_DB = 0 REDIS_CONNECT_RETRY = True

Voilá, no need for an AMQP broker.

Celery provides many advantages over simple cron jobs. Like cron, you can schedule periodic tasks , but you can also run tasks in response to other incentives without supporting a request / response cycle.

If you do not want to calculate the schedule for each active user so often, you will need to generate it on demand. I assume that queries for the most recent available averages are cheap, calculating new averages is expensive, and you generate actual client-side charts using something like flot . Here is an example thread:

The user requests a page containing a graph of averages.
Check cache - is there a stored, complicated request containing average values for this user?
- If yes, use this.
- If not, cancel the celery task to recount it, require and cache the result. Since querying existing data is cheap, run a query if you want to display stale data to the user.
If the chart is out of date. it’s not necessary to indicate some signs that the chart is out of date or to make some ajax fix for ping django so often and ask if the updated schedule is ready.

You could combine this with a periodic task to recalculate the schedule every hour for users who have an active session to prevent the display of really outdated charts. This is not the only way to fool a cat, but it provides you with all the control you need to ensure freshness while regulating the processor load with the task of computing. Best of all, the periodic task and the on-demand task share the same logic - you define the task once and call it from both places to add DRY.

Caching

The Django cache framework provides you with all the caching you need to cache as much as you want, as much as you want. Most production sites rely on memcached as their cache, I recently started using redis with django-redis-cache instead, but I'm not sure I will trust it on the main production site.

Here's a code demonstrating the use of a low-level caching API to execute the workflow described above:

 import pickle from django.core.cache import cache from django.shortcuts import render from mytasks import calculate_stuff from celery.task import task @task def calculate_stuff(user_id): # ... do your work to update the averages ... # now pull the latest series averages = TransactionAverage.objects.filter(user=user_id, ...) # cache the pickled result for ten minutes cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10) def myview(request, user_id): ctx = {} cached = cache.get("averages_%s" % user_id, None) if cached: averages = pickle.loads(cached) # use the cached queryset else: # fetch the latest available data for now, same as in the task averages = TransactionAverage.objects.filter(user=user_id, ...) # fire off the celery task to update the information in the background calculate_stuff.delay(user_id) # doesn't happen in-process. ctx['stale_chart'] = True # display a warning, if you like ctx['averages'] = averages # ... do your other work ... render(request, 'my_template.html', ctx)

Edit: It is worth noting that etching a set of requests loads the entire request into memory. If you pull up a large amount of data using a set of query averages, this may be suboptimal. Testing using real data would be reasonable anyway.

Django design for web analytics screens that take a very long time to calculate

Offline processing

Cron

Celery

Caching

More articles: