Edit 2
Second approach. At the moment, I have refused to use multiple instances and configured cleaning options so as not to use simultaneous requests. It is slow but stable. I discovered generosity. Who can help do this job at the same time? If I configure scrapy to run at the same time, I get segmentation errors.
class WebkitDownloader( object ): def __init__(self): os.environ["DISPLAY"] = ":99" self.proxyAddress = "a: b@ " + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT) def process_response(self, request, response, spider): self.request = request self.response = response if 'cached' not in response.flags: webkitBrowser = webkit.WebkitBrowser(proxy = self.proxyAddress, gui=False, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
Edit:
I tried to answer my question in the meantime and implemented the queue, but for some reason it does not start asynchronously. Basically, when webkitBrowser.get(html=response.body, num_retries=0) busy, the delay is blocked until the method completes. New queries are not assigned to the remaining free instances in self.queue .
Can someone point me in the right direction to make this work?
class WebkitDownloader( object ): def __init__(self): proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT) self.queue = list() for i in range(8): self.queue.append(webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=5.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])) def process_response(self, request, response, spider): i = 0 for webkitBrowser in self.queue: i += 1 if webkitBrowser.status == "WAITING": break webkitBrowser = self.queue[i] if webkitBrowser.status == "WAITING":
I am using WebKit in middleware rendering software to render JavaScript. Currently scrapy is configured to process 1 request at a time (no concurrency).
I would like to use concurrency (for example, 8 requests at a time), but then I need to make sure that 8 instances of WebkitBrowser() receive requests based on their individual processing state (a new request, as soon as WebkitBrowser.get() is executed and ready to receive next request)
How can I achieve this using Python? This is my current middleware:
class WebkitDownloader( object ): def __init__(self): proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT) self.w = webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt']) def process_response(self, request, response, spider): if not ".pdf" in response.url: