Scan Parallel or Distributed Crawls

I would like to use scrapy to crawl rather large sites. In some cases, I will already have links to scratches, and in others I will need to extract (scan) them. I will also need to access the database twice at startup. Once, to determine if a URL is required for correction (Spider middleware) and once to store the extracted information (object pipeline). Ideally, I could run parallel or distributed crawls to speed things up. What is the recommended way to start a parallel or distributed crawl using scrapy?

+4
source share
2 answers

You should check scrapy_redis .

It is very simple to implement. Yours schedulerand duplicate filterwill be stored in the redis queue. All spiders will work at the same time, and you must speed up the scan time.

Hope this helps.

0
source

The Scrapy Cluster documentation contains a page listing many of the existing Scrapy-based solutions for distributed scans.

0
source

All Articles