Scan Parallel or Distributed Crawls

Question

Scan Parallel or Distributed Crawls

I would like to use scrapy to crawl rather large sites. In some cases, I will already have links to scratches, and in others I will need to extract (scan) them. I will also need to access the database twice at startup. Once, to determine if a URL is required for correction (Spider middleware) and once to store the extracted information (object pipeline). Ideally, I could run parallel or distributed crawls to speed things up. What is the recommended way to start a parallel or distributed crawl using scrapy?

+4

concurrency scrapy distributed

user1247196 May 27 '15 at 16:45

source share

2 answers

BLANQUER Adrien · Answer 1 · 2017-04-26T15:29:24+0000

You should check scrapy_redis .

It is very simple to implement. Yours schedulerand duplicate filterwill be stored in the redis queue. All spiders will work at the same time, and you must speed up the scan time.

Hope this helps.

Gallaecio · Answer 2 · 2019-01-30T13:48:12+0000

The Scrapy Cluster documentation contains a page listing many of the existing Scrapy-based solutions for distributed scans.

Scan Parallel or Distributed Crawls

More articles: