Python uWSGI boot configuration

We have a large EC2 instance with 32 cores, currently running Nginx, Tornado and Redis, serving an average of 5K requests per second. Everything seems to be working fine, but CPU utilization is already up to 70%, and we need to support even more requests. One idea was to replace Tornado with uWSGI, because we really do not use the asynchronous functions of Tornado.

Our application consists of one function, it receives JSON (~ = 4KB), doing some blocking but very fast things (Redis) and returns JSON.

  • Proxy HTTP request to one of the Tornado instances (Nginx)
  • Parse an HTTP request (Tornado)
  • Read the body line of POST (compressed JSON) and convert it to a python dictionary (Tornado)
  • Extract data from Redis (locks) located on the same machine (py-redis with hiredis)
  • Process data (python3.4)
  • Update Redis on the same machine (py-redis with hiredis)
  • Prepare compressed JSON for the response (python3.4)
  • Send proxy response (Tornado)
  • Send a response to the client (Nginx)

We thought that speed improvement would be achieved using the uwsgi protocol, we can install Nginx on a separate server and proxy server for all requests to uWSGI using the uwsgi protocol. But after trying all possible configurations and changing the OS parameters, we still can not get it to work even at the current load. In most cases, the nginx log contains errors 499 and 502. In some configurations, it simply stopped receiving new requests, for example, to some OS limit.

So, as I said, we have 32 cores, 60 GB of free memory and a very fast network. We do not do heavy things, only very fast blocking operations. What is the best strategy in this case? Processes, Threads, Async? What OS parameters should be installed?

Current configuration:

[uwsgi] master = 2 processes = 100 socket = /tmp/uwsgi.sock wsgi-file = app.py daemonize = /dev/null pidfile = /tmp/uwsgi.pid listen = 64000 stats = /tmp/stats.socket cpu-affinity = 1 max-fd = 20000 memory-report = 1 gevent = 1000 thunder-lock = 1 threads = 100 post-buffering = 1 

Nginx config:

 user www-data; worker_processes 10; pid /run/nginx.pid; events { worker_connections 1024; multi_accept on; use epoll; } 

OS configuration:

 sysctl net.core.somaxconn net.core.somaxconn = 64000 

I know that the limits are too high, I began to try any possible value.

UPDATE

I ended up with the following configuration:

 [uwsgi] chdir = %d master = 1 processes = %k socket = /tmp/%c.sock wsgi-file = app.py lazy-apps = 1 touch-chain-reload = %dreload virtualenv = %d.env daemonize = /dev/null pidfile = /tmp/%c.pid listen = 40000 stats = /tmp/stats-%c.socket cpu-affinity = 1 max-fd = 200000 memory-report = 1 post-buffering = 1 threads = 2 
+5
source share
1 answer

I think your request is processed like this:

  • HTTP parsing, request routing, JSON parsing
  • execute some python code that gives redis request
  • (block) redis request
  • execute python code that processes redis response
  • JSON serialization, HTTP response serialization

You can check the processing time on an almost empty system. My guess is that the round trip will be reduced to 2 or 3 milliseconds. When the processor load is 70%, this will increase to about 4 or 5 ms (not counting the time spent on the nginx request queue, just processing in the uWSGI desktop).

In 5k req / s, your average query in the process can be in the range of 20 ... 25. A decent match with your virtual machine.

The next step is to balance the CPU cores. If you have 32 cores, it makes no sense to allocate 1000 workflows. Ultimately, you can turn off the system when redistributing the context. The total number of employees (nginx + uWSGI + redis) in order of magnitude as available processor cores will have good balance, perhaps with a small amount of additional means for blocking I / O (i.e. the file system, but network requests are mostly made to others hosts like DBMS). If blocking I / O is becoming a big part of the equation, consider rewriting asynchronous code and integrating the asynchronous stack.

First observation: you distribute 10 workers to nginx. However, the CPU nginx time spent on the request is much less than the time that uWSGI spends on it. I would start by dedicating about 10% of the nginx system (3 or 4 workflows).

The rest should be shared between uWSGI and redis. I don't know about the size of your indexes in redis or the complexity of your Python code, but my first attempt is a 75% / 25% separation between uWSGI and redis. This will put redis for about 6 workers and uWSGI for about 20 workers + master.

Regarding the threads option in the uwsgi setup: switching threads is easier than switching processes, but if a large part of your Python code is CPU related, it won't fly due to the GIL. Thread options are mostly interesting if a significant portion of your processing time is blocked. You can disable threads or try with workers = 10, threads = 2 as an initial attempt.

+7
source

All Articles