Python vinyl agent timeout

I'm new to twisting, and I'm trying to make an asynchronous client that retrieves some URLs and stores the result in a separate file for each URL. When I run a program with a limited number of servers, say 10, the reactor loop ends correctly and the program ends. But when I start the program, for example, with Alexa top 2500, the program starts to extract URLs, but then does not end. I set a timeout, but it does not work, I believe that there should be some kind of open socket that does not cause any callback for error or success. My goal is that after the program has pulled out pages or the timeout for each session has expired, the program should shut down and close all file descriptors.

Sorry, but the indentation of the code is not saved during copy and paste, now I checked and fixed it. The code is minimal, to give an example, please note that with my problem is that the reactor does not stop when I run the program with a huge number of sites to scan.

#!/usr/bin/env python from pprint import pformat from twisted.internet import reactor import twisted.internet.defer import sys from twisted.internet.protocol import Protocol from twisted.web.client import Agent from twisted.web.http_headers import Headers class PrinterClient(Protocol): def __init__(self, whenFinished, output): self.whenFinished = whenFinished self.output = output def dataReceived(self, bytes): #print '##### Received #####\n%s' % (bytes,) self.output.write('%s' % (bytes,)) def connectionLost(self, reason): print 'Finished:', reason.getErrorMessage() self.output.write('Finished: %s \n'%(reason.getErrorMessage())) self.output.write('#########end########%s\n'%(reason.getErrorMessage())) self.whenFinished.callback(None) def handleResponse(r, output, url): output.write('############start############\n') output.write('%s\n'%(url)) #print "version=%s\ncode=%s\nphrase='%s'" % (r.version, r.code, r.phrase) output.write("version=%s\ncode=%s\nphrase='%s'"\ %(r.version, r.code, r.phrase)) for k, v in r.headers.getAllRawHeaders(): #print "%s: %s" % (k, '\n '.join(v)) output.write("%s: %s\n" % (k, '\n '.join(v))) whenFinished = twisted.internet.defer.Deferred() r.deliverBody(PrinterClient(whenFinished, output)) return whenFinished def handleError(reason): print reason #reason.printTraceback() #reactor.stop() def getPage(url, output): print "Requesting %s" % (url,) d = Agent(reactor).request('GET', url, Headers({'User-Agent': ['Mozilla/4.0 (Windows XP 5.1) Java/1.6.0_26']}), None) d._connectTimeout = 10 d.addCallback(handleResponse, output, url) d.addErrback(handleError) return d if __name__ == '__main__': semaphore = twisted.internet.defer.DeferredSemaphore(500) dl = list() ipset = set() queryset = set(['http://www.google.com','http://www.google1.com','http://www.google2.com', "up to 2500 sites"]) filemap = {} for q in queryset: fpos = q.split('http://')[1].split(':')[0] dl.append(semaphore.run(getPage, q, filemap[fpos])) dl = twisted.internet.defer.DeferredList(dl) dl.addCallbacks(lambda x: reactor.stop(), handleError) reactor.run() for k in filemap: filemap[k].close() 

Thanks. Jeppo

+4
source share
1 answer

There are at least two problems in your timeout.

First, the only timeout you set is _connectTimeout , and you set it to Deferred returned from Agent.request . This is a meaningless attribute, and nothing in the Agent implementation, nor any part of Twisted will respect it. I think you would like to set this attribute instead of the Agent instance, where it could affect. However, this is a private attribute that is not intended for direct interaction. Instead, you should pass connectTimeout=10 to the Agent initializer.

Secondly, this timeout only affects the TCP connection setup timeout. Setting it to 10 means that if the TCP connection to the HTTP server for a specific URL cannot be established in less than 10 seconds, a request attempt will fail with a timeout error. If the connection is successfully established in less than 10 seconds, the wait time no longer matters. If the server needs 10 hours to send you a response, Agent will sit there and wait 10 hours. You need an additional timeout, the timeout of the full request.

This can be implemented separately using reactor.callLater and possibly Deferred.cancel . For instance,

 ... d = agent.request(...) timeoutCall = reactor.callLater(60, d.cancel) def completed(passthrough): if timeoutCall.active(): timeoutCall.cancel() return passthrough d.addBoth(completed) ... 
+6
source

All Articles