Running multiple spiders in radiation therapy

  • In scrapy, for example, if I have two URLs that contain different HTML. Now I want to write two separate spiders for one and immediately launch both spiders. In scrapy, you can run several spiders at once.

  • In the process of rescanning several spiders, how do we plan to launch them every 6 hours (possibly, like cron works)

I had no idea what was above, can you suggest me how to accomplish the above things with an example.

Thanks in advance.

+7
source share
4 answers

It would probably just be easy to run two scrapy scripts at once from the OS level. Both of them should be able to save in one database. Create a shell script to invoke both scrapy scripts simultaneously:

scrapy runspider foo & scrapy runspider bar 

Be sure to create this script executable with chmod +x script_name

To schedule a cronjob every 6 hours, enter crontab -e in your terminal and edit the file as follows:

 * */6 * * * path/to/shell/script_name >> path/to/file.log 

The first * is minutes, then hours, etc., and asterik is a wildcard. Thus, this indicates that the script is run at any time when the clock is divided by 6 or every six hours.

+2
source

You must use scrapyd to handle multiple crawlers http://doc.scrapy.org/en/latest/topics/scrapyd.html

+2
source

You can try using CrawlerProcess

 from scrapy.utils.project import get_project_settings from scrapy.crawler import CrawlerProcess from myproject.spiders import spider1, spider2 1Spider = spider1.1Spider() 2Spider = spider2.2Spider() process = CrawlerProcess(get_project_settings()) process.crawl(1Spider) process.crawl(2Spider) process.start() 

If you want to view the full crawl log, set LOG_FILE to settings.py .

 LOG_FILE = "logs/mylog.log" 
+1
source

Here is the code that allows you to run several spiders in the treatment process. Save this code in the same directory with scrapy.cfg (My version of scrapy is 1.3.3 and it works):

 from scrapy.utils.project import get_project_settings from scrapy.crawler import CrawlerProcess setting = get_project_settings() process = CrawlerProcess(setting) for spider_name in process.spiders.list(): print ("Running spider %s" % (spider_name)) process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy process.start() 

and then you can schedule the launch of this python cronjob program.

0
source

All Articles