How to implement concurrency in spy (Python) middleware

Edit 2

Second approach. At the moment, I have refused to use multiple instances and configured cleaning options so as not to use simultaneous requests. It is slow but stable. I discovered generosity. Who can help do this job at the same time? If I configure scrapy to run at the same time, I get segmentation errors.

class WebkitDownloader( object ): def __init__(self): os.environ["DISPLAY"] = ":99" self.proxyAddress = "a: b@ " + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT) def process_response(self, request, response, spider): self.request = request self.response = response if 'cached' not in response.flags: webkitBrowser = webkit.WebkitBrowser(proxy = self.proxyAddress, gui=False, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt']) #print "added to queue: " + str(self.counter) webkitBrowser.get(html=response.body, num_retries=0) html = webkitBrowser.current_html() respcls = responsetypes.from_args(headers=response.headers, url=response.url) kwargs = dict(cls=respcls, body=killgremlins(html)) response = response.replace(**kwargs) webkitBrowser.setPage(None) del webkitBrowser return response 

Edit:

I tried to answer my question in the meantime and implemented the queue, but for some reason it does not start asynchronously. Basically, when webkitBrowser.get(html=response.body, num_retries=0) busy, the delay is blocked until the method completes. New queries are not assigned to the remaining free instances in self.queue .

Can someone point me in the right direction to make this work?

 class WebkitDownloader( object ): def __init__(self): proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT) self.queue = list() for i in range(8): self.queue.append(webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=5.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])) def process_response(self, request, response, spider): i = 0 for webkitBrowser in self.queue: i += 1 if webkitBrowser.status == "WAITING": break webkitBrowser = self.queue[i] if webkitBrowser.status == "WAITING": # load webpage print "added to queue: " + str(i) webkitBrowser.get(html=response.body, num_retries=0) webkitBrowser.scrapyResponse = response while webkitBrowser.status == "PROCESSING": print "waiting for queue: " + str(i) if webkitBrowser.status == "DONE": print "fetched from queue: " + str(i) #response = webkitBrowser.scrapyResponse html = webkitBrowser.current_html() respcls = responsetypes.from_args(headers=response.headers, url=response.url) kwargs = dict(cls=respcls, body=killgremlins(html)) #response = response.replace(**kwargs) webkitBrowser.status = "WAITING" return response 

I am using WebKit in middleware rendering software to render JavaScript. Currently scrapy is configured to process 1 request at a time (no concurrency).

I would like to use concurrency (for example, 8 requests at a time), but then I need to make sure that 8 instances of WebkitBrowser() receive requests based on their individual processing state (a new request, as soon as WebkitBrowser.get() is executed and ready to receive next request)

How can I achieve this using Python? This is my current middleware:

 class WebkitDownloader( object ): def __init__(self): proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT) self.w = webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt']) def process_response(self, request, response, spider): if not ".pdf" in response.url: # load webpage self.w.get(html=response.body, num_retries=0) html = self.w.current_html() respcls = responsetypes.from_args(headers=response.headers, url=response.url) kwargs = dict(cls=respcls, body=killgremlins(html)) response = response.replace(**kwargs) return response 
+6
source share
2 answers

I don’t do everything in your question because I don’t know what scrapy is and I don’t understand what segfault will cause, but I think I can solve one question: why scrap block when webkitBrowser.get starts

I do not see anything in your "queue" example that will provide you with parallelism. You can usually use either the threading module or multiprocessing so that several things can work in parallel. Instead of just calling webkitBrowser.get , I suspect that you can run it in a thread. Getting web pages is a case where python threads should work quite well. Python cannot simultaneously perform two tasks with an intensive processor (due to the GIL), but it can expect responses from web servers in parallel.

Here is a recent SO Q / A with sample code that might help.

Here you will learn how to get started. Create Queue . Define a function that takes this queue as an argument, receives a web page, and puts the response in the queue. In the main program, enter the while True: after all receiving streams appear: check the queue and process the next record or time.sleep(.1) if it is empty.

+2
source

I know this is an old question, but I had a similar question and hope that this information I came across helps others with the same question:

  • If scrapyjs + splash works for you (given that you are using a webkit browser, they probably do it, since splash is webkit based), this is probably the easiest solution;

  • If 1 does not work, you can simultaneously run multiple spiders using scrapyd or multiprocessing using scrapy ;

  • Depending on how much your browser’s render primarily expects (for pages to render) intensive or intensive I / O, you can use non-blocking sleep with twisted, multi-thread or multi-processor processing. For the latter, the value of sticking to the skin will decrease, and you may want to crack a simple scraper (for example, the web finder created by A. Jesse Jiryu Davis and Guido van Rossum: and a document ) or create your own.

0
source

All Articles