I'm new to twisting, and I'm trying to make an asynchronous client that retrieves some URLs and stores the result in a separate file for each URL. When I run a program with a limited number of servers, say 10, the reactor loop ends correctly and the program ends. But when I start the program, for example, with Alexa top 2500, the program starts to extract URLs, but then does not end. I set a timeout, but it does not work, I believe that there should be some kind of open socket that does not cause any callback for error or success. My goal is that after the program has pulled out pages or the timeout for each session has expired, the program should shut down and close all file descriptors.
Sorry, but the indentation of the code is not saved during copy and paste, now I checked and fixed it. The code is minimal, to give an example, please note that with my problem is that the reactor does not stop when I run the program with a huge number of sites to scan.
#!/usr/bin/env python from pprint import pformat from twisted.internet import reactor import twisted.internet.defer import sys from twisted.internet.protocol import Protocol from twisted.web.client import Agent from twisted.web.http_headers import Headers class PrinterClient(Protocol): def __init__(self, whenFinished, output): self.whenFinished = whenFinished self.output = output def dataReceived(self, bytes): #print '##### Received #####\n%s' % (bytes,) self.output.write('%s' % (bytes,)) def connectionLost(self, reason): print 'Finished:', reason.getErrorMessage() self.output.write('Finished: %s \n'%(reason.getErrorMessage())) self.output.write('#########end########%s\n'%(reason.getErrorMessage())) self.whenFinished.callback(None) def handleResponse(r, output, url): output.write('############start############\n') output.write('%s\n'%(url)) #print "version=%s\ncode=%s\nphrase='%s'" % (r.version, r.code, r.phrase) output.write("version=%s\ncode=%s\nphrase='%s'"\ %(r.version, r.code, r.phrase)) for k, v in r.headers.getAllRawHeaders(): #print "%s: %s" % (k, '\n '.join(v)) output.write("%s: %s\n" % (k, '\n '.join(v))) whenFinished = twisted.internet.defer.Deferred() r.deliverBody(PrinterClient(whenFinished, output)) return whenFinished def handleError(reason): print reason #reason.printTraceback() #reactor.stop() def getPage(url, output): print "Requesting %s" % (url,) d = Agent(reactor).request('GET', url, Headers({'User-Agent': ['Mozilla/4.0 (Windows XP 5.1) Java/1.6.0_26']}), None) d._connectTimeout = 10 d.addCallback(handleResponse, output, url) d.addErrback(handleError) return d if __name__ == '__main__': semaphore = twisted.internet.defer.DeferredSemaphore(500) dl = list() ipset = set() queryset = set(['http://www.google.com','http://www.google1.com','http://www.google2.com', "up to 2500 sites"]) filemap = {} for q in queryset: fpos = q.split('http://')[1].split(':')[0] dl.append(semaphore.run(getPage, q, filemap[fpos])) dl = twisted.internet.defer.DeferredList(dl) dl.addCallbacks(lambda x: reactor.stop(), handleError) reactor.run() for k in filemap: filemap[k].close()
Thanks. Jeppo