I would like to use scrapy to crawl rather large sites. In some cases, I will already have links to scratches, and in others I will need to extract (scan) them. I will also need to access the database twice at startup. Once, to determine if a URL is required for correction (Spider middleware) and once to store the extracted information (object pipeline). Ideally, I could run parallel or distributed crawls to speed things up. What is the recommended way to start a parallel or distributed crawl using scrapy?
source
share