Read timeout using either urllib2 or any other http library

Question

Read timeout using either urllib2 or any other http library

I have code to read a url like:

from urllib2 import Request, urlopen req = Request(url) for key, val in headers.items(): req.add_header(key, val) res = urlopen(req, timeout = timeout) # This line blocks content = res.read()

A timeout works for calling urlopen (). But then the code gets to the res.read () call, where I want to read the response data, and the timeout is not applied there. Thus, a read call can hang almost forever, waiting for data from the server. The only solution I found was to use a signal to interrupt read (), which is not suitable for me, since I use streams.

What other options are there? Is there an HTTP library for Python that handles read timeouts? I looked at httplib2 and the requests and they seem to suffer from the same problem as above. I don’t want to write my own non-blocking network code using the socket module, because I think there should already be a library for this.

Update: None of the solutions below do this for me. You yourself can make sure that setting a timeout for a socket or urlopen does not affect the loading of a large file:

 from urllib2 import urlopen url = 'http://iso.linuxquestions.org/download/388/7163/http/se.releases.ubuntu.com/ubuntu-12.04.3-desktop-i386.iso' c = urlopen(url) c.read()

At least on Windows with Python 2.7.3, timeouts are completely ignored.

+21

python http nonblocking timeout sockets

Björn Lindqvist Mar 03 '12 at 18:51

source share

8 answers

Alfe · Answer 1 · 2012-05-10 13:41

I found in my tests (using the method described here ) that the timeout set in the urlopen() call also calls the read() call:

 import urllib2 as u c = u.urlopen('http://localhost/', timeout=5.0) s = c.read(1<<20) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/socket.py", line 380, in read data = self._sock.recv(left) File "/usr/lib/python2.7/httplib.py", line 561, in read s = self.fp.read(amt) File "/usr/lib/python2.7/httplib.py", line 1298, in read return s + self._file.read(amt - len(s)) File "/usr/lib/python2.7/socket.py", line 380, in read data = self._sock.recv(left) socket.timeout: timed out

Maybe this is a feature of newer versions? I am using Python 2.7 on Ubuntu 12.04 right out of the box.

nitwit · Answer 2 · 2012-03-09 21:31

One of the possible (imperfect) solutions is to set a global socket timeout, described in more detail here :

 import socket import urllib2 # timeout in seconds socket.setdefaulttimeout(10) # this call to urllib2.urlopen now uses the default timeout # we have set in the socket module req = urllib2.Request('http://www.voidspace.org.uk') response = urllib2.urlopen(req)

However, this only works if you want to globally change the timeout for all users of the socket module. I am executing a request from a Celery task, so this can ruin the timeouts for the actual working celery code.

I would be glad to hear any other solutions ...

user479870 · Answer 3 · 2015-09-20 21:51

It is impossible for any library to do this without using any asynchronous timer through streams or otherwise. The reason is that the timeout parameter used in httplib , urllib2 and other libraries sets the timeout to the base socket . And what it actually does is explained in the documentation .

SO_RCVTIMEO
Sets a timeout value that defines the maximum period of time that the input function waits for completion. It takes a timeval structure with the number of seconds and microseconds that determine the time limit for waiting for the input operation to complete. If the receive operation is blocked during this time without receiving additional data , it must be returned with a partial counter or errno set to [EAGAIN] or [EWOULDBLOCK] if data is not received.

The bold part is the key. A socket.timeout is only created if no bits were received during the timeout time. In other words, this is the timeout between received bytes.

A simple function using threading.Timer could be the following:

 import httplib import socket import threading def download(host, path, timeout = 10): content = None http = httplib.HTTPConnection(host) http.request('GET', path) response = http.getresponse() timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD]) timer.start() try: content = response.read() except httplib.IncompleteRead: pass timer.cancel() # cancel on triggered Timer is safe http.close() return content >>> host = 'releases.ubuntu.com' >>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1) >>> print content is None True >>> content = download(host, '/15.04/MD5SUMS', 1) >>> print content is None False

In addition to checking for None , you can also catch the httplib.IncompleteRead exception not inside the function, but outside it. The latter case will not work if the HTTP request does not have a Content-Length header.

kolinko · Answer 4 · 2013-08-07 18:21

I expect this to be a common problem, and yet - no answers can be found anywhere ... Just built a solution for this using a timeout signal:

 import urllib2 import socket timeout = 10 socket.setdefaulttimeout(timeout) import time import signal def timeout_catcher(signum, _): raise urllib2.URLError("Read timeout") signal.signal(signal.SIGALRM, timeout_catcher) def safe_read(url, timeout_time): signal.setitimer(signal.ITIMER_REAL, timeout_time) url = 'http://uberdns.eu' content = urllib2.urlopen(url, timeout=timeout_time).read() signal.setitimer(signal.ITIMER_REAL, 0) # you should also catch any exceptions going out of urlopen here, # set the timer to 0, and pass the exceptions on.

Credit for the signal part of the solution goes here btw: python timer secret

jfs · Answer 5 · 2015-09-21 00:35

pycurl.TIMEOUT works for the whole query :

 #!/usr/bin/env python3 """Test that pycurl.TIMEOUT does limit the total request timeout.""" import sys import pycurl timeout = 2 #NOTE: it does limit both the total *connection* and *read* timeouts c = pycurl.Curl() c.setopt(pycurl.CONNECTTIMEOUT, timeout) c.setopt(pycurl.TIMEOUT, timeout) c.setopt(pycurl.WRITEFUNCTION, sys.stdout.buffer.write) c.setopt(pycurl.HEADERFUNCTION, sys.stderr.buffer.write) c.setopt(pycurl.NOSIGNAL, 1) c.setopt(pycurl.URL, 'http://localhost:8000') c.setopt(pycurl.HTTPGET, 1) c.perform()

The code causes a timeout error of ~ 2 seconds. I checked the general read timeout with the server, which sends the response in several fragments with a time less than the timeout between the pieces:

 $ python -mslow_http_server 1

where slow_http_server.py :

 #!/usr/bin/env python """Usage: python -mslow_http_server [<read_timeout>] Return an http response with *read_timeout* seconds between parts. """ import time try: from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer, test except ImportError: # Python 3 from http.server import BaseHTTPRequestHandler, HTTPServer, test def SlowRequestHandlerFactory(read_timeout): class HTTPRequestHandler(BaseHTTPRequestHandler): def do_GET(self): n = 5 data = b'1\n' self.send_response(200) self.send_header("Content-type", "text/plain; charset=utf-8") self.send_header("Content-Length", n*len(data)) self.end_headers() for i in range(n): self.wfile.write(data) self.wfile.flush() time.sleep(read_timeout) return HTTPRequestHandler if __name__ == "__main__": import sys read_timeout = int(sys.argv[1]) if len(sys.argv) > 1 else 5 test(HandlerClass=SlowRequestHandlerFactory(read_timeout), ServerClass=HTTPServer)

I tested the total connection time from http://google.com:22222 .

ChrisP · Answer 6 · 2012-03-03 19:01

This is not the behavior that I see. I get a URLError when the call time ends:

 from urllib2 import Request, urlopen req = Request('http://www.google.com') res = urlopen(req,timeout=0.000001) # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # ... # raise URLError(err) # urllib2.URLError: <urlopen error timed out>

You cannot catch this error and then not try to read res ? When I try to use res.read() after this, I get NameError: name 'res' is not defined. Something like this that you need:

 try: res = urlopen(req,timeout=3.0) except: print 'Doh!' finally: print 'yay!' print res.read()

I believe that the way to execute the timeout manually is through multiprocessing , no? If the task is not finished yet, you can finish it.

jfs · Answer 7 · 2015-09-21 20:17

Any asynchronous network library should provide a full timeout for any I / O operation, for example, here is a sample gevent code :

 #!/usr/bin/env python2 import gevent import gevent.monkey # $ pip install gevent gevent.monkey.patch_all() import urllib2 with gevent.Timeout(2): # enforce total timeout response = urllib2.urlopen('http://localhost:8000') encoding = response.headers.getparam('charset') print response.read().decode(encoding)

And here is the asynchronous equivalent :

 #!/usr/bin/env python3.5 import asyncio import aiohttp # $ pip install aiohttp async def fetch_text(url): response = await aiohttp.get(url) return await response.text() text = asyncio.get_event_loop().run_until_complete( asyncio.wait_for(fetch_text('http://localhost:8000'), timeout=2)) print(text)

A test http server is described here.

user2869459 · Answer 8

There was the same problem with the socket timeout in the read statement. What worked for me was to have both the caster and the one read inside the statement of the attempt. Hope this helps!

Read timeout using either urllib2 or any other http library

More articles: