Multiple Python processes slower

I have a python script that disconnects and makes several HTTP and urllib requests for different domains.

We have a huge number of domains for processes and we need to do this as quickly as possible. Since HTTP requests are slow (i.e., they may not be available on the website in the domain), I run several scripts at any time, loading them from the list of domains in the database.

The problem that I see over a certain period of time (from several hours to 24 hours) is that all scripts start to slow down, and ps -al shows that they were sleeping.

The servers are very powerful (8 cores, 72 GB RAM, 6 TB Raid 6, etc. etc. 80 MB 2: 1) and are never exceeded, i.e. Free -m shows

 -/+ buffers/cache: 61157 11337 Swap: 4510 195 4315 

Top impressions between 80-90% downtime

sar -d shows the average value of 5.3% util

and more interestingly, iptraf starts at a speed of 50-60 MB / s and ends at 8-10 MB / s in about 4 hours.

I currently use about 500 script versions on each server (2 servers), and both of them show the same problem.

ps -al shows that most python scripts are sleeping, and I don’t understand why for example:

 0 S 0 28668 2987 0 80 0 - 71003 sk_wai pts/2 00:00:03 python 0 S 0 28669 2987 0 80 0 - 71619 inet_s pts/2 00:00:31 python 0 S 0 28670 2987 0 80 0 - 70947 sk_wai pts/2 00:00:07 python 0 S 0 28671 2987 0 80 0 - 71609 poll_s pts/2 00:00:29 python 0 S 0 28672 2987 0 80 0 - 71944 poll_s pts/2 00:00:31 python 0 S 0 28673 2987 0 80 0 - 71606 poll_s pts/2 00:00:26 python 0 S 0 28674 2987 0 80 0 - 71425 poll_s pts/2 00:00:20 python 0 S 0 28675 2987 0 80 0 - 70964 sk_wai pts/2 00:00:01 python 0 S 0 28676 2987 0 80 0 - 71205 inet_s pts/2 00:00:19 python 0 S 0 28677 2987 0 80 0 - 71610 inet_s pts/2 00:00:21 python 0 S 0 28678 2987 0 80 0 - 71491 inet_s pts/2 00:00:22 python 

The sleep state is not executed in the script, so I can’t understand why ps -al shows that most of them are sleeping, and why they should slower and slower to make fewer IP requests over time, when the processor, memory, disk access and bandwidth available in abundance.

If anyone could help, I would be very grateful.

EDIT:

The code is massive, as I use exceptions from it to catch domain diagnostics, i.e. the reasons why I can’t connect. Will host the code somewhere if necessary, but the main calls through HTTPLib and URLLib are directly on python examples.

Additional Information:

AND

quota -u mysql quota -u root

come back without anything

nlimit -n returns with 1024 have a change limit.conf to allow mysql to allow 16,000 soft and hard connections and can run up to 2000 script so far, but still a problem.

SOME PROGRESS

Well, that's why I changed all the restrictions for the user, ensured that all sockets were closed (there were none), and although everything will be better, I still slow down, although not so badly.

Interestingly, I also noticed some memory leak - the scripts use more and more memory the longer they work, but I'm not sure what this causes. I store the output in a line and then print it to the terminal after each iteration, I also clear the line at the end, but can the memory all increase to a terminal that saves all the output?

Edit: No, it seems, no - 30 scripts were run without output to the terminal and still the same leaks. I don't use anything smart (just strings, HTTPlib and URLLib) - I wonder if there are any problems with the mysql python connector ...?

+4
source share
4 answers

Check ulimit and quota for the field and the user running the scripts. /etc/security/limits.conf may also contain resource limits that you might want to change.

ulimit -n will show the maximum number of open file descriptors allowed.

  • Perhaps this is exceeded with all open sockets?
  • Does a script close every socket when it works with it?

You can also check fd with ls -l /proc/[PID]/fd/ , where [PID] is the process identifier of one of the scripts.

You will need to see some code to tell what is really happening.


Edit (Import comments and other troubleshooting ideas):

Can you show the code where you open and close connections?
When you run several script processes, do they also become idle after a while? Or is it only when several hundred + starts at once, what happens?
Is there one parent process that runs all of these scripts?

If you use s = urllib2.urlopen(someURL) , make sure s.close() executed it. Python can often close things for you (for example, if you do x = urllib2.urlopen(someURL).read() ), but it will leave it to you if you said (for example, assigning a variable to the returned value .urlopen() ). Double check your opening and closing urllib calls (or all I / O code to be safe). If each script has only one open socket at a time, and your /proc/PID/fd shows several active / open sockets for the script process, then there is a certain problem with the code to fix.

ulimit -n , showing 1024 , gives the limit of the open socket / fd that mysql user can have, you can change this with ulimit -S -n [LIMIT_#] , but check this article first:
Changing the process.max file descriptor with 'ulimit -n' can cause MySQL to change the value of table_open_cache .

You may need to log out and turn it on again. And / Or add it to /etc/bashrc (don't forget source /etc/bashrc if you change bashrc and don't want to exit it / in).

Disk space is another thing I discovered (hard way) that can cause very strange problems. I had processes, how they work (not zombie), but do not do what is expected, because they had open log file descriptors in a section with zero disk space.

netstat -anpTee | grep -i mysql netstat -anpTee | grep -i mysql will also show if these sockets are connected / installed / waiting to be closed / waiting for a timeout / etc.

watch -n 0.1 'netstat -anpTee | grep -i mysql' watch -n 0.1 'netstat -anpTee | grep -i mysql' to see how sockets open / close / change state / etc in real time in good table output (perhaps you need to export GREP_OPTIONS= first if you have something like --color=always ).

lsof -u mysql or lsof -U will also show you open FD (the output is pretty verbose).


 import urllib2 import socket socket.settimeout(15) # or settimeout(0) for non-blocking: #In non-blocking mode (blocking is the default), if a recv() call # doesn't find any data, or if a send() call can't # immediately dispose of the data, # a error exception is raised. #...... try: s = urllib2.urlopen(some_url) # do stuff with s like s.read(), s.headers, etc.. except (HTTPError, etcError): # myLogger.exception("Error opening: %s!", some_url) finally: try: s.close() # del s - although, I don't know if deleting s will help things any. except: pass 

Some help pages and links:

+7
source

Solved! - with great help from Chaun - thank you very much!

The slowdown was due to the fact that I did not set the socket timeout and, as such, for a certain period of time, the robots where they were hanging, trying to read data that was not there. Adding a simple

 timeout = 5 socket.setdefaulttimeout(timeout) 

decided (shame on me - but in my defense I'm still learning python)

A memory leak boils down to urllib and the version of python used. After many searches, it seems that the problem with nested urlopens is a lot of posts on the Internet about this when you develop a question about the right Google question.

Thank you all for your help.

EDIT:

Something that also helped the memory leak problem (although it did not solve it completely) was doing manual garbage collection:

 import gc gc.collect 

Hope this helps someone else.

+2
source

This is probably some kind of system resource from which you were starving. Guess: could you feel the limits of the socket pool that your system can manage? If so, you can see improved performance if you can close sockets faster or faster.

EDIT: Depends on the effort you want to make, you can restructure the application so that a single process executes multiple requests. One socket can be reused from the same process, as well as many different resources. Twisted is very susceptible to this type of programming.

+1
source

Another system resource to consider is the ephemeral ports /proc/sys/net/ipv4/ip_local_port_range (on Linux). Together with /proc/sys/net/ipv4/tcp_fin_timeout they limit the number of concurrent connections.

From Python WSGI Server Checkpoint :

This basically allows the server to open LOTS concurrent connections.

 echo "10152 65535″ > /proc/sys/net/ipv4/ip_local_port_range sysctl -w fs.file-max=128000 sysctl -w net.ipv4.tcp_keepalive_time=300 sysctl -w net.core.somaxconn=250000 sysctl -w net.ipv4.tcp_max_syn_backlog=2500 sysctl -w net.core.netdev_max_backlog=2500 ulimit -n 10240 
+1
source

All Articles