Running multiple spiders in radiation therapy

Question

Running multiple spiders in radiation therapy

In scrapy, for example, if I have two URLs that contain different HTML. Now I want to write two separate spiders for one and immediately launch both spiders. In scrapy, you can run several spiders at once.
In the process of rescanning several spiders, how do we plan to launch them every 6 hours (possibly, like cron works)

I had no idea what was above, can you suggest me how to accomplish the above things with an example.

Thanks in advance.

+7

python web-crawler scrapy

shiva krishna Jun 08 '12 at 5:58

source share

4 answers

You must use scrapyd to handle multiple crawlers http://doc.scrapy.org/en/latest/topics/scrapyd.html

+2

fxp Nov 11 '12 at 16:43

source share

You can try using CrawlerProcess

 from scrapy.utils.project import get_project_settings from scrapy.crawler import CrawlerProcess from myproject.spiders import spider1, spider2 1Spider = spider1.1Spider() 2Spider = spider2.2Spider() process = CrawlerProcess(get_project_settings()) process.crawl(1Spider) process.crawl(2Spider) process.start()

If you want to view the full crawl log, set LOG_FILE to settings.py .

 LOG_FILE = "logs/mylog.log"

+1

Aminah nuraini Mar 22 '17 at 3:14

source share

Here is the code that allows you to run several spiders in the treatment process. Save this code in the same directory with scrapy.cfg (My version of scrapy is 1.3.3 and it works):

 from scrapy.utils.project import get_project_settings from scrapy.crawler import CrawlerProcess setting = get_project_settings() process = CrawlerProcess(setting) for spider_name in process.spiders.list(): print ("Running spider %s" % (spider_name)) process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy process.start()

and then you can schedule the launch of this python cronjob program.

0

Yuda prawira May 11, '17 at 23:29

source share

foxyNinja7 · Accepted Answer · 2012-06-08T06:36:59+0000

It would probably just be easy to run two scrapy scripts at once from the OS level. Both of them should be able to save in one database. Create a shell script to invoke both scrapy scripts simultaneously:

scrapy runspider foo & scrapy runspider bar

Be sure to create this script executable with chmod +x script_name

To schedule a cronjob every 6 hours, enter crontab -e in your terminal and edit the file as follows:

 * */6 * * * path/to/shell/script_name >> path/to/file.log

The first * is minutes, then hours, etc., and asterik is a wildcard. Thus, this indicates that the script is run at any time when the clock is divided by 6 or every six hours.

Running multiple spiders in radiation therapy

More articles: