We have a deployed Express Web API that receives a decent, but still relatively small amount of traffic (on average about 10 requests per second) that runs on the EC2 ubuntu server, which is proxied through NGINX. Each request so often freezes, and if the client waits long enough, the line below will be output to the NGINX error log:
upstream timed out (110: Connection timed out) while connecting to upstream
I have already tried the proposed solution here , but it does not seem to have affected. This happens only with our knowledge about 1-3 times per minute, but I just move away from these magazines. If the client refreshes its page or moves before the request time expires, there seems to be no record.
The error message obviously indicates that something is wrong with the upstream server, but why is this so rare? Also, there are absolutely no patterns in the URLs that cause this problem, and the proxy application remains accessible as far as I know. Here is the idea for our NGINX configuration:
user www-data; worker_processes 4; pid /run/nginx.pid; events { worker_connections 10000; } worker_rlimit_nofile 25000; http { sendfile on; tcp_nopush on; tcp_nodelay on; keepalive_timeout 65; types_hash_max_size 2048; client_max_body_size 15M; include /etc/nginx/mime.types; include /etc/nginx/conf.d/ *.conf; //Added space before star because so formatting was turning it into a comment include /etc/nginx/sites-enabled/ *; default_type application/octet-stream; log_format nginx_json '{ "timestamp": "$time_local", ' ' "request_ip": "$remote_addr", ' ' "request_user": "$remote_user", ' ' "request_bytes_sent": "$bytes_sent", ' ' "response_status": "$status", ' ' "request": "$request", ' ' "request_method": "$request_method", ' ' "http_referrer": "$http_referer", ' ' "http_user_agent": "$http_user_agent", ' ' "request_id": "$request_id", ' ' "server_name": "$server_name",' ' "response_time": "$upstream_response_time" }'; access_log /var/log/nginx/access.log nginx_json; error_log /var/log/nginx/error.log; gzip on; gzip_disable "msie6"; ssl_prefer_server_ciphers on; ssl_session_cache shared:SSL:10m; ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:ECDHE-RSA-AES128-GCM-SHA256:AES256+EECDH:DHE-RSA-AES128-GCM-SHA256:AES256+EDH:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:AES256-SHA:AES128-SHA:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!MD5:!PSK:!RC4"; ssl_protocols TLSv1 TLSv1.1 TLSv1.2; ssl_dhparam /etc/ssl/certs/dhparam.pem; resolver 127.0.0.1 valid=30s; server { listen 80; server_name a.mysite.com; return 301 https://$server_name$request_uri; } server { listen 443 ssl; server_name a.mysite.com; add_header Strict-Transport-Security "max-age=31536000"; add_header Cache-Control no-cache; location /api { proxy_pass http://1.2.3.4:3001; proxy_set_header Host $host; proxy_set_header X-Request-Id $request_id; proxy_set_header Connection ""; proxy_http_version 1.1; } location /ui2 { set $uiHost https://abc.cloudfront.net/ui2/index.html?v=1503438694163; proxy_pass $uiHost; } location / { set $uiHost https://abc.cloudfront.net/ui/index.html?v=1504012942606; proxy_pass $uiHost; } ssl_certificate /path/to/certificate; ssl_certificate_key /path/to/certificate/key; }
The server blocks at the bottom of this are repeated for several subdomains, with the /api path usually pointing to the same server on different ports. One subdomain receives most of the traffic. The upstream server (1.2.3.4 in the example) is configured with EC2 security groups to allow access only from the NGINX server. Obviously, the error message indicates that there may be something wrong with the upstream server running Express, but nothing in our logs indicates that this is happening.
Some great notes:
- I recently increased
worker_connections from 768 to 10,000, which seems to have made the problem a little less common. However, we never reach anywhere closer to the connection limit, and the connections are closed. - After this increase, every time a
reload is executed in NGINX, we do not get any of these errors for about 10 minutes. This is the main reason why I think NGINX is the culprit, but I'm not an expert. - I saw a previous post somewhere when Googling was around that the
proxy_set_header Host $host; statement proxy_set_header Host $host; could cause this, which did not make much sense to me, but there was something to think about. I have not tested removing this. - The API server with the express application always works fine, and this server is not under heavy load from what we can say
- This issue does not occur on cloud proxies
Does anyone have something obvious or ideas for what to explore next? Could really use some help here as we are pretty lost.
Update: I added some additional variables to the logs, as recommended, and was able to associate the error with the access log. Here are the relevant variables:
{ "offset": 64270628, "response_status": "504", "upstream_header_time": "60.001", "input_type": "log", "source": "/var/log/nginx/access.log", "request_method": "GET", "http_user_agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko", "@timestamp": "2017-08-30T15:29:15.981Z", "upstream_connect_time": "60.001", "request_user": "-", "response_time": "60.001", "request_bytes_sent": "345", "request_id": "90a41e2224cc4b2c1d3c23d544b9146c", "timestamp": "30/Aug/2017:15:29:15 +0000" }